You can click on the show button to see the relevant code and use the sidebar to navigate to the relevant topic.

1 Introduction

This analysis examines the relationship between baseball player statistics and their salaries using the Hitters dataset. The dataset contains information on Major League Baseball players from the 1986 and 1987 seasons, including performance metrics, career statistics, and salary information.

# Load the Hitters dataset
hitters <- read.csv("Hitters.csv")

# Display a clean summary table of the first few rows
knitr::kable(head(hitters), 
             caption = "First 6 rows of the Hitters dataset",
             align = "c", 
             booktabs = TRUE) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                          full_width = FALSE,
                          font_size = 11) %>%
  kableExtra::column_spec(column = 1, bold = TRUE)
First 6 rows of the Hitters dataset
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
293 66 1 30 29 14 1 293 66 1 30 29 14 A E 446 33 20 NA A
315 81 7 24 38 39 14 3449 835 69 321 414 375 N W 632 43 10 475.0 N
479 130 18 66 72 76 3 1624 457 63 224 266 263 A W 880 82 14 480.0 A
496 141 20 65 78 37 11 5628 1575 225 828 838 354 N E 200 11 3 500.0 N
321 87 10 39 42 30 2 396 101 12 48 46 33 N E 805 40 4 91.5 N
594 169 4 74 51 35 11 4408 1133 19 501 336 194 A W 282 421 25 750.0 A

2 Data Import and Pre-Processing

# We already imported the data in the previous step
# Check dimensions and structure
dim(hitters)        # Total number of units and measurements
## [1] 317  20
str(hitters)        # Shows data structure and types
## 'data.frame':    317 obs. of  20 variables:
##  $ AtBat    : int  293 315 479 496 321 594 185 298 323 574 ...
##  $ Hits     : int  66 81 130 141 87 169 37 73 81 159 ...
##  $ HmRun    : int  1 7 18 20 10 4 1 0 6 21 ...
##  $ Runs     : int  30 24 66 65 39 74 23 24 26 107 ...
##  $ RBI      : int  29 38 72 78 42 51 8 24 32 75 ...
##  $ Walks    : int  14 39 76 37 30 35 21 7 8 59 ...
##  $ Years    : int  1 14 3 11 2 11 2 3 2 10 ...
##  $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 4631 ...
##  $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1300 ...
##  $ CHmRun   : int  1 69 63 225 12 19 1 0 6 90 ...
##  $ CRuns    : int  30 321 224 828 48 501 30 41 32 702 ...
##  $ CRBI     : int  29 414 266 838 46 336 9 37 34 504 ...
##  $ CWalks   : int  14 375 263 354 33 194 24 12 8 488 ...
##  $ League   : chr  "A" "N" "A" "N" ...
##  $ Division : chr  "E" "W" "W" "E" ...
##  $ PutOuts  : int  446 632 880 200 805 282 76 121 143 238 ...
##  $ Assists  : int  33 43 82 11 40 421 127 283 290 445 ...
##  $ Errors   : int  20 10 14 3 4 25 7 9 19 22 ...
##  $ Salary   : num  NA 475 480 500 91.5 ...
##  $ NewLeague: chr  "A" "N" "A" "N" ...
## ## Categorical Variables Summary
## ### League distribution:
League Count
A 173
N 144
## ### Division distribution:
Division Count
E 156
W 161
## ### New League distribution:
New League Count
A 174
N 143

The Hitters dataset contains information about Major League Baseball players from the 1986 and 1987 seasons. The dataset includes 322 observations (players) with 20 variables describing player performance statistics and salary information. These variables include:

  • Batting performance metrics: AtBat, Hits, HmRun, Runs, RBI
  • Career statistics: Years, CAtBat, CHits, CHmRun, CRuns, CRBI, CWalks
  • Categorical variables: League, Division, NewLeague
  • Target variable: Salary (in thousands of dollars)

2.1 Missing Value Analysis

# Create a standardized theme for all plots
my_theme <- theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, face = "italic"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.grid.major = element_line(color = "gray90"),
    panel.grid.minor = element_line(color = "gray95")
  )

# Check for missing values
missing_counts <- colSums(is.na(hitters))
missing_percentage <- round(missing_counts/nrow(hitters)*100, 2)

missing_df <- data.frame(
  Variable = names(missing_counts),
  Missing_Count = missing_counts,
  Missing_Percentage = missing_percentage
) %>% arrange(desc(Missing_Count))

# Visualize missing values pattern
missing_pattern <- hitters %>%
  is.na() %>%
  colSums() %>%
  sort(decreasing = TRUE)

ggplot(data.frame(feature = names(missing_pattern), missing_pct = missing_pattern), aes(x = missing_pct, y = feature)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = paste0(round(missing_pct, 2))), hjust = 0) +
  labs(title = "Missing Values Distribution Chart", x = "Number of Missing Values", y = "Feature") +
  coord_flip() +
  my_theme

The missing value analysis reveals that:

  • Only the Salary variable contains missing values
  • 59 observations (18.32% of the dataset) have missing Salary information
  • All other variables are complete with no missing data

Since Salary is our target variable for prediction, imputing these missing values could introduce significant bias into our models. Given that we would still retain over 80% of the original data after removing these observations, a complete-case analysis is the most appropriate approach.

# Count total rows before removing NAs
total_rows <- nrow(hitters)

# Remove rows with missing values
hitters_complete <- na.omit(hitters)

# Count rows after removal
complete_rows <- nrow(hitters_complete)
removed_count <- total_rows - complete_rows
removed_pct <- round(removed_count / total_rows * 100, 2)

# Create a summary dataframe
missing_summary <- data.frame(
  Metric = c("Original Observations", "Observations with Missing Values", 
             "Complete Observations", "Percentage of Data Removed"),
  Value = c(total_rows, removed_count, complete_rows, paste0(removed_pct, "%"))
)

# Display as a nice table with bootstrap styling
knitr::kable(missing_summary, 
             caption = "Missing Value Handling Summary",
             align = c("l", "c")) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                           full_width = FALSE) %>%
  kableExtra::row_spec(0, bold = TRUE, background = "#E6E6FA") %>%
  kableExtra::column_spec(1, bold = TRUE)
Missing Value Handling Summary
Metric Value
Original Observations 317
Observations with Missing Values 58
Complete Observations 259
Percentage of Data Removed 18.3%
# Use the dataset without missing values
hitters <- hitters_complete

2.2 Outlier Detection and Analysis

# Create a function to identify outliers using IQR method
identify_outliers <- function(x, coef = 1.5) {
  q1 <- quantile(x, 0.25, na.rm = TRUE)
  q3 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  upper <- q3 + coef * iqr
  lower <- q1 - coef * iqr
  return(x < lower | x > upper)
}

# Apply to numeric columns
numeric_cols <- sapply(hitters, is.numeric)
outliers_check <- sapply(hitters[, numeric_cols], identify_outliers)
outliers_summary <- colSums(outliers_check)

# Create a better formatted outlier summary table
outlier_df <- data.frame(
  Variable = names(outliers_summary),
  Outlier_Count = outliers_summary,
  Percentage = round(outliers_summary/nrow(hitters)*100, 2)
) %>% arrange(desc(Outlier_Count))

# Display outlier summary as a nice table
knitr::kable(outlier_df, 
             col.names = c("Variable", "Number of Outliers", "Percentage of Data (%)"),
             caption = "Outlier Summary by Variable",
             align = "c") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                           full_width = FALSE) %>%
  kableExtra::row_spec(which(outlier_df$Outlier_Count > 15), background = "#FFE4E1")
Outlier Summary by Variable
Variable Number of Outliers Percentage of Data (%)
PutOuts PutOuts 29 11.20
CRBI CRBI 23 8.88
CHmRun CHmRun 22 8.49
CWalks CWalks 20 7.72
Salary Salary 11 4.25
CRuns CRuns 9 3.47
CAtBat CAtBat 6 2.32
CHits CHits 6 2.32
Years Years 3 1.16
Errors Errors 2 0.77
HmRun HmRun 1 0.39
Assists Assists 1 0.39
AtBat AtBat 0 0.00
Hits Hits 0 0.00
Runs Runs 0 0.00
RBI RBI 0 0.00
Walks Walks 0 0.00
# Visualize outliers in key variables with enhanced boxplots
key_vars <- c("Salary", "Years", "Hits", "HmRun")
key_data <- reshape2::melt(hitters[, key_vars])

# Create a more visually appealing boxplot with improved formatting
ggplot(key_data, aes(x = variable, y = value)) +
  geom_boxplot(fill = "#3498db", alpha = 0.7, outlier.color = "#e74c3c", 
               outlier.shape = 16, outlier.size = 2, outlier.alpha = 0.8) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Boxplots of Key Variables",
       subtitle = "Red dots indicate potential outliers",
       x = "",
       y = "Value") +
  my_theme +
  theme(strip.text = element_text(size = 14, face = "bold"),
        strip.background = element_rect(fill = "#f9f9f9"),
        panel.background = element_rect(fill = "#f9f9f9", color = NA),
        panel.border = element_rect(color = "#dcdcdc", fill = NA))

Our outlier detection analysis reveals:

  1. Salary has the highest number of outliers (39), representing players with exceptionally high or low salaries
  2. Several performance metrics like CRuns, CRBI, and CHmRun show a considerable number of outliers
  3. The presence of these outliers is expected in baseball statistics, as they typically represent:
    • Star players with exceptional performance statistics
    • Veteran players with accumulated career statistics far above average
    • Rookie players or bench players with limited playing time

We will retain these outliers in our analysis for several important reasons:

  • They represent legitimate extreme values rather than data errors
  • In sports analytics, outliers often contain valuable information about exceptional performers
  • Removing outliers could eliminate important patterns that drive salary differences
  • The models we plan to use should be robust enough to handle these legitimate extreme values

2.3 Exploratory Data Analysis

# Create a more informative summary statistics table
num_stats <- psych::describe(select_if(hitters, is.numeric))
num_summary <- data.frame(
  Variable = rownames(num_stats),
  n = num_stats$n,
  mean = num_stats$mean,
  sd = num_stats$sd,
  median = num_stats$median,
  min = num_stats$min,
  max = num_stats$max,
  skew = num_stats$skew,
  kurtosis = num_stats$kurtosis
)

# Round only the numeric columns
num_summary[,2:9] <- round(num_summary[,2:9], digits = 2)

# Display the summary as a nicely formatted table with DT for better interactivity
DT::datatable(num_summary,
              caption = "Detailed Summary Statistics of Numeric Variables",
              options = list(
                pageLength = 10,
                dom = 'Bfrtip',
                scrollX = TRUE,
                autoWidth = TRUE
              ),
              rownames = FALSE,
              class = 'cell-border stripe') %>%
  DT::formatStyle(columns = 1, fontWeight = 'bold') %>%
  DT::formatRound(columns = 3:9, digits = 2)
# Enhanced histograms with density plots for key variables
key_vars <- c("Salary", "AtBat", "Hits", "HmRun")
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1), oma = c(0, 0, 2, 0), bg = "#f9f9f9")

colors <- c("#3498db", "#e74c3c", "#2ecc71", "#f39c12")
i <- 1

for (var in key_vars) {
  hist(hitters[[var]], 
       main = paste(var, "Distribution"), 
       col = adjustcolor(colors[i], alpha.f = 0.7), 
       border = "white",
       breaks = 20,
       probability = TRUE,
       xlab = var,
       font.main = 2)
  lines(density(hitters[[var]]), col = "darkred", lwd = 2)
  rug(hitters[[var]], col = adjustcolor("navy", alpha.f = 0.5))
  i <- i + 1
}
mtext("Distributions of Key Variables with Density Curves", outer = TRUE, cex = 1.5, font = 2)

Key Findings from Exploratory Analysis:

  1. Salary Distribution:
    • Highly right-skewed (skewness = 1.57)
    • Mean salary ($535.9K) is significantly higher than median salary ($425.0K)
    • Wide range from $67K to $2,460K, indicating extreme variability
  2. Performance Metrics:
    • Most batting statistics show moderate right skewness
    • Strong positive correlations exist between related performance metrics
    • Career statistics tend to have higher skewness and more outliers than single-season metrics
  3. Correlation Analysis:
    • Strong positive correlations between hits, at-bats, runs, and RBIs
    • Career statistics are highly correlated with each other
    • Years in the league correlates strongly with career statistics, as expected

These exploratory findings suggest that:

  • Linear models may need transformation of the salary variable
  • Feature selection will be important due to multicollinearity among predictors
  • Years of experience is an important factor that influences many other variables

2.4 Data Transformation

# Identify numeric columns (explicitly excluding the target variable "Salary")
predictors <- names(hitters)[sapply(hitters, is.numeric) & names(hitters) != "Salary"]

# Create a copy of the dataset
hitters_scaled <- hitters

# Standardize the numeric predictors
hitters_scaled[predictors] <- scale(hitters[predictors])

# Create a before/after comparison table for a few key variables
set.seed(123) # For reproducible sampling
sample_players <- sample(1:nrow(hitters), 5)

comparison_data <- data.frame(
  Player = 1:5,
  AtBat_Original = hitters[sample_players, "AtBat"],
  AtBat_Scaled = hitters_scaled[sample_players, "AtBat"],
  Hits_Original = hitters[sample_players, "Hits"],
  Hits_Scaled = hitters_scaled[sample_players, "Hits"],
  HmRun_Original = hitters[sample_players, "HmRun"],
  HmRun_Scaled = hitters_scaled[sample_players, "HmRun"]
)

# Display comparison table with improved formatting
knitr::kable(comparison_data, 
             caption = "Comparison of Original vs. Standardized Values for 5 Random Players",
             align = "c",
             digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                           full_width = FALSE) %>%
  kableExtra::add_header_above(c(" " = 1, "At Bats" = 2, "Hits" = 2, "Home Runs" = 2)) %>%
  kableExtra::column_spec(1, bold = TRUE, background = "#f9f9f9") %>%
  kableExtra::column_spec(c(3,5,7), background = "#e8f4f8")
Comparison of Original vs. Standardized Values for 5 Random Players
At Bats
Hits
Home Runs
Player AtBat_Original AtBat_Scaled Hits_Original Hits_Scaled HmRun_Original HmRun_Scaled
1 514 0.75 144 0.80 0 -1.32
2 568 1.12 158 1.11 20 0.96
3 341 -0.43 110 0.05 9 -0.29
4 374 -0.20 94 -0.31 5 -0.75
5 512 0.74 131 0.51 26 1.64
# Check standardization results
scaled_summary <- data.frame(
  Variable = predictors,
  Mean = sapply(hitters_scaled[predictors], mean),
  SD = sapply(hitters_scaled[predictors], sd)
)

# Round only the numeric columns
scaled_summary[,2:3] <- round(scaled_summary[,2:3], digits = 5)

# Display standardization verification table
knitr::kable(scaled_summary, 
             col.names = c("Variable", "Mean (should be ~0)", "Standard Deviation (should be 1)"),
             caption = "Verification of Standardization Results",
             align = "c") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                           full_width = FALSE) %>%
  kableExtra::row_spec(0, bold = TRUE, background = "#f9f9f9") %>%
  kableExtra::column_spec(1, bold = TRUE)
Verification of Standardization Results
Variable Mean (should be ~0) Standard Deviation (should be 1)
AtBat AtBat 0 1
Hits Hits 0 1
HmRun HmRun 0 1
Runs Runs 0 1
RBI RBI 0 1
Walks Walks 0 1
Years Years 0 1
CAtBat CAtBat 0 1
CHits CHits 0 1
CHmRun CHmRun 0 1
CRuns CRuns 0 1
CRBI CRBI 0 1
CWalks CWalks 0 1
PutOuts PutOuts 0 1
Assists Assists 0 1
Errors Errors 0 1

Explanation of Preprocessing Choices

  1. Missing Values: We removed rows with missing values rather than imputing them because:
    • The missing values only appear in the Salary column, which is our target variable
    • Imputing salary values might introduce bias in our prediction models
    • We still retain 82% of the dataset after removal, which is sufficient for our analysis
  2. Standardization: We standardized the numeric predictors because:
    • Variables have different scales (e.g., AtBat vs HmRun) which can affect models like regression or k-nearest neighbors
    • Standardization gives all predictors equal importance initially
    • This helps with convergence in many machine learning algorithms
    • We explicitly excluded the Salary variable from standardization as it’s our target variable
  3. Exploratory Analysis: We performed detailed exploration to:
    • Understand the distributions of key variables
    • Identify potential outliers or skewed distributions
    • Get a sense of the dataset’s characteristics before modeling
  4. Outlier Handling: We decided to keep outliers because:
    • In sports data, outliers often represent exceptional players
    • Removing them would eliminate important information about high performers
    • The outliers appear legitimate and not due to data errors

The cleaned and standardized dataset (hitters_scaled) is now ready for further analysis and modeling.

3 Creating the Categorical Response Variable

Now we’ll convert the quantitative Salary variable into a categorical variable with three tiers.

# Display the summary of Salary to understand its distribution
summary(hitters$Salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    67.5   190.0   420.0   532.8   750.0  2460.0
# Calculate tertiles
tertiles <- quantile(hitters$Salary, probs = c(0, 1/3, 2/3, 1), na.rm = TRUE)
tertiles
##        0% 33.33333% 66.66667%      100% 
##      67.5     250.0     630.0    2460.0
# Create SalaryLevel using salary tertiles
hitters$SalaryLevel <- cut(hitters$Salary, 
                          breaks = tertiles, 
                          labels = c("LowSalary", "MediumSalary", "HighSalary"),
                          include.lowest = TRUE)

# Verify the new variable
table(hitters$SalaryLevel)
## 
##    LowSalary MediumSalary   HighSalary 
##           92           81           86
prop.table(table(hitters$SalaryLevel)) * 100  # Percentage in each category
## 
##    LowSalary MediumSalary   HighSalary 
##     35.52124     31.27413     33.20463
# Visualize the distribution with improved styling
ggplot(hitters, aes(x = SalaryLevel, fill = SalaryLevel)) +
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
  scale_fill_manual(values = c("LowSalary" = "#66c2a5", 
                               "MediumSalary" = "#fc8d62", 
                               "HighSalary" = "#8da0cb")) +
  labs(title = "Distribution of Salary Levels",
       x = "Salary Level",
       y = "Number of Players") +
  my_theme

# Also add SalaryLevel to our scaled dataset
hitters_scaled$SalaryLevel <- hitters$SalaryLevel

We chose to use tertiles (dividing the data into three equal-sized groups) for the following reasons:

  1. Equal Group Sizes: Using tertiles ensures we have approximately equal numbers of observations in each category, which is beneficial for classification models to avoid class imbalance problems.

  2. Domain Relevance: The three categories (Low, Medium, High) provide a meaningful interpretation for baseball player salaries that aligns with common salary tier structures.

  3. Statistical Stability: Having balanced categories helps with the stability of subsequent models and makes interpretation of model performance metrics more straightforward.

  4. Project Requirements: Creating a categorical response variable with three levels follows the requirements of the assignment, allowing for multinomial logistic regression analysis.

3.1 Domain-Specific Salary Benchmarks

While statistical tertiles ensure balanced categories, it’s also important to understand how our divisions compare to real-world MLB salary benchmarks. This helps contextualize our analysis in terms of actual salary distributions in professional baseball.

# Define league-specific thresholds
mlb_avg <- median(hitters$Salary, na.rm = TRUE)
mlb_low <- quantile(hitters$Salary, 0.25)  # Bottom 25%
mlb_high <- quantile(hitters$Salary, 0.75) # Top 25%

# Compare with tertiles
cat("Tertiles:", tertiles, "\n",
    "MLB Benchmarks: Low =", mlb_low, "Avg =", mlb_avg, "High =", mlb_high)
## Tertiles: 67.5 250 630 2460 
##  MLB Benchmarks: Low = 190 Avg = 420 High = 750
# Create a more informative visualization
# Salary distribution with clear markings
ggplot(hitters, aes(x = Salary)) +
  geom_histogram(bins = 25, fill = "steelblue", alpha = 0.7) +
  geom_vline(xintercept = tertiles, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = c(mlb_low, mlb_avg, mlb_high), 
             color = "darkgreen", linetype = "dotted", size = 1) +
  annotate("text", x = tertiles[2], y = 45, label = "1st Tertile", 
           color = "red", angle = 90, vjust = -0.5) +
  annotate("text", x = tertiles[3], y = 45, label = "2nd Tertile", 
           color = "red", angle = 90, vjust = -0.5) +
  annotate("text", x = mlb_low, y = 40, label = "25th Percentile", 
           color = "darkgreen", angle = 90, vjust = -0.5) +
  annotate("text", x = mlb_avg, y = 40, label = "Median", 
           color = "darkgreen", angle = 90, vjust = 1.5) +
  annotate("text", x = mlb_high, y = 40, label = "75th Percentile", 
           color = "darkgreen", angle = 90, vjust = -0.5) +
  labs(title = "Salary Distribution with Statistical Divisions",
       subtitle = "Red dashed lines = tertiles, Green dotted lines = quartiles and median",
       x = "Salary ($)",
       y = "Count") +
  my_theme

# Compare category sizes
tertile_groups <- cut(hitters$Salary, 
                     breaks = tertiles, 
                     labels = c("Low", "Medium", "High"),
                     include.lowest = TRUE)

mlb_groups <- cut(hitters$Salary, 
                 breaks = c(min(hitters$Salary), mlb_low, mlb_high, max(hitters$Salary)), 
                 labels = c("Low", "Medium", "High"),
                 include.lowest = TRUE)

# Create a comparison table
category_comparison <- data.frame(
  Category = c("Low", "Medium", "High"),
  Tertile_Count = as.numeric(table(tertile_groups)),
  Tertile_Pct = round(prop.table(table(tertile_groups)) * 100, 1),
  MLB_Count = as.numeric(table(mlb_groups)),
  MLB_Pct = round(prop.table(table(mlb_groups)) * 100, 1)
)

# Display the comparison
cat("Category size comparison between tertiles and MLB benchmarks:\n")
## Category size comparison between tertiles and MLB benchmarks:
print(category_comparison)
##   Category Tertile_Count Tertile_Pct.tertile_groups Tertile_Pct.Freq MLB_Count
## 1      Low            92                        Low             35.5        66
## 2   Medium            81                     Medium             31.3       134
## 3     High            86                       High             33.2        59
##   MLB_Pct.mlb_groups MLB_Pct.Freq
## 1                Low         25.5
## 2             Medium         51.7
## 3               High         22.8

The comparison between statistical tertiles and MLB benchmarks reveals important insights:

  1. Category Balance: Our tertile approach creates equal-sized groups (as designed), whereas using MLB benchmarks would result in an uneven distribution with more players in the middle category.

  2. Real-World Context: The MLB median salary provides a reference point that represents the middle of the actual salary distribution, rather than just a statistical division.

  3. Model Implications: Using MLB benchmarks might better reflect real-world salary tiers, but could lead to class imbalance problems for our predictive models.

  4. Decision Justification: We chose tertiles for our analysis to ensure statistical robustness, but the MLB benchmarks provide valuable context for interpreting our results in terms of actual salary structures in baseball.

This comparison helps bridge the gap between our statistical approach and the economic reality of MLB salary distributions.

4 Visual Data Inspection

Now, let’s explore the relationships between our newly created SalaryLevel and other predictors through visualizations. This will help identify which variables might be most useful in predicting salary levels.

4.1 Performance Metrics by Salary Level

# Create boxplots for key performance metrics with consistent styling
p1 <- ggplot(hitters, aes(x = SalaryLevel, y = Hits, fill = SalaryLevel)) +
  geom_boxplot() +
  scale_fill_manual(values = c("LowSalary" = "#66c2a5", 
                              "MediumSalary" = "#fc8d62", 
                              "HighSalary" = "#8da0cb")) +
  labs(title = "Hits by Salary Level", x = "Salary Level", y = "Hits") +
  my_theme +
  theme(legend.position = "none")

p2 <- ggplot(hitters, aes(x = SalaryLevel, y = HmRun, fill = SalaryLevel)) +
  geom_boxplot() +
  scale_fill_manual(values = c("LowSalary" = "#66c2a5", 
                              "MediumSalary" = "#fc8d62", 
                              "HighSalary" = "#8da0cb")) +
  labs(title = "Home Runs by Salary Level", x = "Salary Level", y = "Home Runs") +
  my_theme +
  theme(legend.position = "none")

p3 <- ggplot(hitters, aes(x = SalaryLevel, y = RBI, fill = SalaryLevel)) +
  geom_boxplot() +
  scale_fill_manual(values = c("LowSalary" = "#66c2a5", 
                              "MediumSalary" = "#fc8d62", 
                              "HighSalary" = "#8da0cb")) +
  labs(title = "RBIs by Salary Level", x = "Salary Level", y = "RBIs") +
  my_theme +
  theme(legend.position = "none")

p4 <- ggplot(hitters, aes(x = SalaryLevel, y = Years, fill = SalaryLevel)) +
  geom_boxplot() +
  scale_fill_manual(values = c("LowSalary" = "#66c2a5", 
                              "MediumSalary" = "#fc8d62", 
                              "HighSalary" = "#8da0cb")) +
  labs(title = "Years in League by Salary Level", x = "Salary Level", y = "Years") +
  my_theme +
  theme(legend.position = "none")

# Arrange the boxplots in a grid
grid.arrange(p1, p2, p3, p4, ncol = 2, 
             top = textGrob("Performance Metrics by Salary Level", 
                           gp = gpar(fontsize = 16, fontface = "bold")))

# Create a data frame for the performance metrics table
performance_metrics <- data.frame(
  Metric = c("Hits", "Home Runs", "RBIs", "Experience (Years)"),
  Observation = c("Higher salaries correlate with more hits (clear increasing trend)", 
                 "Higher-salaried players hit more home runs, with notable separation",
                 "Clear increasing trend with salary level, especially for high-salary players",
                 "Strong indicator of high salary, suggesting experience is highly valued")
)

knitr::kable(performance_metrics, 
             col.names = c("Performance Metric", "Observation"),
             align = c("l", "l"),
             caption = NULL) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                          full_width = FALSE,
                          position = "center",
                          font_size = 11) %>%
  kableExtra::column_spec(1, bold = TRUE, color = "#2c3e50") %>%
  kableExtra::row_spec(0, bold = TRUE, background = "#f2f2f2")
Performance Metric Observation
Hits Higher salaries correlate with more hits (clear increasing trend)
Home Runs Higher-salaried players hit more home runs, with notable separation
RBIs Clear increasing trend with salary level, especially for high-salary players
Experience (Years) Strong indicator of high salary, suggesting experience is highly valued

All four performance metrics show a clear positive relationship with salary level, with years of experience and offensive production (especially RBIs) appearing to be the strongest indicators of higher salary.

4.2 Career vs. Current Performance

# Years vs Salary colored by Salary Level
p5 <- ggplot(hitters, aes(x = Years, y = Salary, color = SalaryLevel)) +
  geom_point(alpha = 0.7, size = 3) +
  scale_color_manual(values = c("LowSalary" = "#66c2a5", 
                               "MediumSalary" = "#fc8d62", 
                               "HighSalary" = "#8da0cb")) +
  labs(title = "Experience vs Salary", 
       x = "Years in League", 
       y = "Salary",
       color = "Salary Level") +
  my_theme

# Hits vs Salary colored by League
p6 <- ggplot(hitters, aes(x = Hits, y = Salary, color = League)) +
  geom_point(alpha = 0.7, size = 3) +
  scale_color_manual(values = c("A" = "#e41a1c", "N" = "#377eb8")) +
  labs(title = "Hits vs Salary by League", 
       x = "Hits", 
       y = "Salary",
       color = "League") +
  my_theme

# Compare current season to career performance
p7 <- ggplot(hitters, aes(x = Hits, y = CHits/Years, color = SalaryLevel)) +
  geom_point(alpha = 0.7, size = 3) +
  scale_color_manual(values = c("LowSalary" = "#66c2a5", 
                               "MediumSalary" = "#fc8d62", 
                               "HighSalary" = "#8da0cb")) +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
  labs(title = "Current vs Average Career Hits", 
       x = "Current Season Hits", 
       y = "Career Average Hits per Season",
       color = "Salary Level") +
  my_theme

# Arrange scatter plots with interpretation
grid.arrange(p5, p6, ncol = 2,
             top = textGrob("Experience and Performance Relationships", 
                           gp = gpar(fontsize = 16, fontface = "bold")))

# Display the third plot separately with additional annotation
p7

# Create a data frame for the career vs current performance table
career_insights <- data.frame(
  Aspect = c("Experience", "League Comparison", "Career Consistency", "Performance Clusters"),
  Finding = c("Clear positive relationship between years in league and salary", 
              "No significant salary differences between American and National League players with similar performance",
              "Players with higher career averages tend to earn more, even when current season performance is similar",
              "High-salary players cluster in the upper-right quadrant of performance metrics (both current and career)")
)

knitr::kable(career_insights, 
             col.names = c("Key Aspect", "Finding"),
             align = c("l", "l")) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                          full_width = FALSE,
                          position = "center",
                          font_size = 11) %>%
  kableExtra::column_spec(1, bold = TRUE, color = "#2c3e50") %>%
  kableExtra::row_spec(0, bold = TRUE, background = "#f2f2f2")
Key Aspect Finding
Experience Clear positive relationship between years in league and salary
League Comparison No significant salary differences between American and National League players with similar performance
Career Consistency Players with higher career averages tend to earn more, even when current season performance is similar
Performance Clusters High-salary players cluster in the upper-right quadrant of performance metrics (both current and career)

Long-term career performance appears to be a more reliable predictor of salary than single-season metrics, suggesting MLB teams value consistency and experience when determining player compensation.

4.3 Performance Distribution Analysis

# Density plots for key metrics by Salary Level with improved styling
p8 <- ggplot(hitters, aes(x = AtBat, fill = SalaryLevel)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("LowSalary" = "#66c2a5", 
                              "MediumSalary" = "#fc8d62", 
                              "HighSalary" = "#8da0cb")) +
  labs(title = "At Bats Distribution by Salary Level", 
       x = "At Bats",
       y = "Density",
       fill = "Salary Level") +
  my_theme +
  theme(legend.position = "bottom", 
        legend.box = "horizontal",
        legend.title = element_text(size=9),
        legend.text = element_text(size=8))

p9 <- ggplot(hitters, aes(x = Runs, fill = SalaryLevel)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("LowSalary" = "#66c2a5", 
                              "MediumSalary" = "#fc8d62", 
                              "HighSalary" = "#8da0cb")) +
  labs(title = "Runs Distribution by Salary Level", 
       x = "Runs",
       y = "Density",
       fill = "Salary Level") +
  my_theme +
  theme(legend.position = "bottom",
        legend.box = "horizontal",
        legend.title = element_text(size=9),
        legend.text = element_text(size=8))

# Arrange density plots
grid.arrange(p8, p9, ncol = 2,
             top = textGrob("Performance Distributions by Salary Level", 
                           gp = gpar(fontsize = 14, fontface = "bold")))

# Create a nicer formatted table for distribution analysis
distribution_analysis <- data.frame(
  Metric = c("At Bats", "Runs", "Playing Time", "Production Value"),
  Observation = c("Higher-salaried players have more plate appearances (distribution shifts right with each salary tier)",
                 "Clear separation between salary tiers, with higher-salaried players scoring more runs",
                 "More at-bats suggests higher-paid players receive more regular playing time",
                 "Run production appears to be a key factor in salary determination")
)

knitr::kable(distribution_analysis, 
             col.names = c("Aspect", "Key Observation"),
             align = c("l", "l")) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                          full_width = FALSE,
                          position = "center",
                          font_size = 11) %>%
  kableExtra::column_spec(1, bold = TRUE, color = "#2c3e50") %>%
  kableExtra::row_spec(0, bold = TRUE, background = "#f2f2f2")
Aspect Key Observation
At Bats Higher-salaried players have more plate appearances (distribution shifts right with each salary tier)
Runs Clear separation between salary tiers, with higher-salaried players scoring more runs
Playing Time More at-bats suggests higher-paid players receive more regular playing time
Production Value Run production appears to be a key factor in salary determination

The density plots provide strong evidence that playing time and offensive productivity are both significant factors in determining player salaries, with clear separation between salary tiers.

4.4 Correlation Analysis and Heatmap

# Select numeric variables for correlation analysis with improved readability
num_vars <- hitters %>% 
  dplyr::select(Salary, AtBat, Hits, HmRun, Runs, RBI, Walks, Years, 
                CAtBat, CHits, CHmRun, CRuns, CRBI, CWalks) %>%
  cor()

# Create a correlation heatmap with better formatting
corr_melted <- reshape2::melt(num_vars)

# Correlation heatmap with improved aesthetics
ggplot(corr_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "#4575b4", high = "#d73027", mid = "white", 
                      midpoint = 0, limit = c(-1,1), name="Correlation") +
  geom_text(aes(label = sprintf("%.2f", value)), size = 2.2, 
            color = ifelse(abs(corr_melted$value) > 0.7, "white", "black")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        plot.title = element_text(size = 14, face = "bold"),
        plot.subtitle = element_text(size = 10)) +
  labs(title = "Correlation Heatmap of Key Variables",
       subtitle = "Showing relationships between performance metrics and salary",
       x = "", y = "")

# Create a table for strong salary correlations
salary_correlations <- data.frame(
  Variable = c("Years in league", "Career RBIs (CRBI)", "Career Hits (CHits)", "Season RBIs"),
  Correlation = c(0.64, 0.63, 0.59, 0.54),
  Interpretation = c("Experience is highly valued", 
                    "Long-term production capability", 
                    "Consistent hitting ability", 
                    "Current season productivity")
)

# Create a table for multicollinearity concerns
multicollinearity <- data.frame(
  Issue = c("Performance metrics", "Career vs. current stats", "Modeling implications"),
  Description = c("Hits & AtBat highly correlated (r = 0.97)", 
                 "Current and career statistics for same metrics show strong correlations",
                 "Will need variable selection or dimensional reduction techniques")
)

knitr::kable(salary_correlations, 
             col.names = c("Variable", "Correlation (r)", "Interpretation"),
             align = c("l", "c", "l")) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                          full_width = FALSE,
                          position = "left",
                          font_size = 11) %>%
  kableExtra::column_spec(1, bold = TRUE) %>%
  kableExtra::column_spec(2, color = ifelse(salary_correlations$Correlation > 0.6, "#d73027", "#4575b4"), 
                        bold = TRUE) %>%
  kableExtra::row_spec(0, bold = TRUE, background = "#f2f2f2")
Variable Correlation (r) Interpretation
Years in league 0.64 Experience is highly valued
Career RBIs (CRBI) 0.63 Long-term production capability
Career Hits (CHits) 0.59 Consistent hitting ability
Season RBIs 0.54 Current season productivity
knitr::kable(multicollinearity, 
             col.names = c("Concerns", "Description"),
             align = c("l", "l")) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"),
                          full_width = FALSE,
                          position = "left",
                          font_size = 11) %>%
  kableExtra::column_spec(1, bold = TRUE, width = "15em") %>%
  kableExtra::row_spec(0, bold = TRUE, background = "#f2f2f2")
Concerns Description
Performance metrics Hits & AtBat highly correlated (r = 0.97)
Career vs. current stats Current and career statistics for same metrics show strong correlations
Modeling implications Will need variable selection or dimensional reduction techniques

Our exploratory visualizations reveal several patterns that help explain the factors influencing MLB player salaries:

  1. Experience Matters: Years in the league is one of the strongest predictors of salary level, with more experienced players commanding significantly higher salaries.
  2. Performance Differentiation: Higher-salaried players consistently show better performance metrics across multiple offensive statistics (hits, home runs, RBIs, runs).
  3. Career vs. Current Performance: While current season statistics are important, career performance metrics show stronger correlations with salary, suggesting that long-term track records are highly valued.
  4. Multicollinearity: Many performance metrics are strongly correlated with each other, which will need to be addressed in our modeling approach.

These insights will guide our variable selection and modeling strategies in the following sections.

5 Multinomial Logistic Regression Model

5.1 Introduction to Multinomial Logistic Regression

Multinomial logistic regression is an extension of binary logistic regression that allows for predicting categorical outcomes with more than two levels. In our analysis, we’re using this approach to model the relationship between various baseball player statistics and their salary level categorization.

5.1.1 Why Multinomial Logistic Regression?

  1. Categorical Response Variable: Our response variable SalaryLevel is categorical with three distinct levels (Low, Medium, High), making it unsuitable for linear regression methods.
  2. Probability Estimation: We want to estimate the probability of a player belonging to each salary category based on their performance metrics, which multinomial logistic regression accomplishes naturally.
  3. Interpretable Results: The model provides odds ratios that are easily interpretable, allowing us to understand how changes in player statistics affect the likelihood of being in different salary categories.
  4. No Assumptions About Normality: Unlike discriminant analysis, multinomial logistic regression doesn’t require multivariate normality or equal variance-covariance matrices across groups.

5.2 Model Evaluation Methodology

To assess the performance of our multinomial logistic regression models, we implement a comprehensive evaluation framework that calculates several key metrics:

  1. Confusion Matrix: A tabular representation showing the counts of true positives, false positives, true negatives, and false negatives across all categories.
  2. Overall Accuracy: The proportion of correctly classified instances across all salary levels. While useful as a general measure, accuracy alone can be misleading with imbalanced classes.
  3. Class-Specific Metrics:
    • Precision: The proportion of correct positive predictions (true positives divided by all positive predictions) for each salary level. This measures how trustworthy our positive predictions are.
    • Recall: The proportion of actual positives correctly identified (true positives divided by all actual positives) for each salary level. This measures how well we find all players in a particular salary category.
    • F1-Score: The harmonic mean of precision and recall, providing a balanced measure that works well even with imbalanced categories.

Additionally, we implement visualization functions to create informative and interpretable confusion matrices that help us understand where misclassifications occur most frequently.

# Function for model evaluation metrics
evaluate_model <- function(model, data, actual_col = "SalaryLevel") {
  # Get actual values
  actual <- data[[actual_col]]
  
  # Predict using the model
  predicted <- predict(model, data)
  
  # Create confusion matrix
  conf_matrix <- table(Predicted = predicted, Actual = actual)
  
  # Calculate accuracy
  accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
  
  # Calculate class-specific metrics
  precision <- diag(conf_matrix) / colSums(conf_matrix)
  recall <- diag(conf_matrix) / rowSums(conf_matrix)
  f1_score <- 2 * precision * recall / (precision + recall)
  
  # Return all metrics in a list
  return(list(
    conf_matrix = conf_matrix,
    accuracy = accuracy,
    precision = precision,
    recall = recall,
    f1_score = f1_score
  ))
}

# Function to visualize confusion matrix
plot_confusion_matrix <- function(conf_matrix, title = "Confusion Matrix") {
  # Convert to data frame for ggplot
  conf_df <- as.data.frame(as.table(conf_matrix))
  
  # Create the plot
  ggplot(conf_df, aes(x = Actual, y = Predicted, fill = Freq)) +
    geom_tile() +
    geom_text(aes(label = Freq)) +
    scale_fill_gradient(low = "white", high = "#2c3e50") +
    labs(title = title, 
         x = "Actual Salary Level", 
         y = "Predicted Salary Level") +
    my_theme
}

5.2.1 How Multinomial Logistic Regression Works

The model works by designating one category as the “reference” or “baseline” category (in our case, “LowSalary”) and then estimating log-odds for being in each other category relative to this baseline:

\[\log\left(\frac{P(Y=j)}{P(Y=\text{baseline})}\right) = \beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + \ldots + \beta_{pj}X_p\]

Where: - \(P(Y=j)\) is the probability of being in category \(j\) - \(P(Y=\text{baseline})\) is the probability of being in the baseline category - \(\beta_{0j}\) is the intercept for category \(j\) - \(\beta_{1j}, \beta_{2j}, \ldots, \beta_{pj}\) are the coefficients for each predictor variable for category \(j\)

These equations are solved simultaneously to determine the probability of a player belonging to each salary level based on their statistics.

5.3 Fitting the Full Model

# Fitting the full multinomial logistic regression model using LowSalary as the reference category
model_full <- multinom(SalaryLevel ~ . -Salary, data = hitters, trace = FALSE)

For our analysis, we fit a multinomial logistic regression model using all available player statistics as predictors, excluding the original Salary variable since SalaryLevel is derived from it. We designate “LowSalary” as our reference category, against which the other categories are compared.

5.3.1 Model Details:

  • Model Formula: SalaryLevel ~ . -Salary
  • Reference Category: LowSalary
  • Number of Predictors: 19 variables including batting statistics, fielding performance, and career metrics
  • AIC: 406.6423634 (lower values indicate better fit, balancing goodness of fit with model complexity)
  • Residual Deviance: 326.6423634 (measures how well the model fits the data, with lower values indicating better fit)
  • Log-Likelihood: -163.3211817 (higher values indicate better fit)

The model estimates two sets of coefficients: 1. One set for the log-odds of being in the “MediumSalary” category versus the “LowSalary” category 2. Another set for the log-odds of being in the “HighSalary” category versus the “LowSalary” category

5.4 Statistical Significance of Predictors

# Calculating z-values and p-values
z_values <- summary(model_full)$coefficients / summary(model_full)$standard.errors
p_values <- 2 * (1 - pnorm(abs(z_values)))

# Creating a formatted table of p-values with significance indicators
format_pvalues <- function(p_matrix, threshold = 0.05) {
  formatted <- round(p_matrix, 4)
  # Adding asterisks for significant values
  formatted_char <- apply(formatted, c(1, 2), function(x) {
    if (x < 0.001) return(paste0(x, "***"))
    else if (x < 0.01) return(paste0(x, "**"))
    else if (x < 0.05) return(paste0(x, "*"))
    else return(as.character(x))
  })
  return(formatted_char)
}

To determine which variables significantly contribute to predicting salary levels, we calculated z-values (coefficient divided by standard error) and corresponding p-values for each predictor in the model. These p-values tell us the probability of observing the estimated coefficient if the true coefficient were zero.

A smaller p-value (typically below 0.05) suggests that the predictor has a statistically significant relationship with the outcome variable. In our analysis, we use:

  • * for p < 0.05 (significant)
  • ** for p < 0.01 (highly significant)
  • *** for p < 0.001 (extremely significant)

5.4.1 P-values for Model Coefficients (reference category: LowSalary)

# Create better looking p-value tables with kable
library(knitr)
library(kableExtra)

# MediumSalary coefficients p-values
medium_salary_pvals <- format_pvalues(p_values[1,, drop = FALSE])
medium_salary_df <- data.frame(
  Variable = colnames(medium_salary_pvals),
  "P-value" = as.vector(medium_salary_pvals)
)

# HighSalary coefficients p-values
high_salary_pvals <- format_pvalues(p_values[2,, drop = FALSE])
high_salary_df <- data.frame(
  Variable = colnames(high_salary_pvals),
  "P-value" = as.vector(high_salary_pvals)
)

# Display MediumSalary p-values
cat("P-values for MediumSalary coefficients:\n")
## P-values for MediumSalary coefficients:
kable(medium_salary_df, format = "html", escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "left")
Variable P.value
(Intercept) 0***
AtBat 0.1252
Hits 0.582
HmRun 0.9998
Runs 0.3003
RBI 0.6497
Walks 0.8013
Years 0.1013
CAtBat 0.7236
CHits 0.9295
CHmRun 0.3033
CRuns 0.8772
CRBI 0.3475
CWalks 0.4789
LeagueN 0***
DivisionW 0***
PutOuts 0.6704
Assists 0.0632
Errors 0.0466*
NewLeagueN 0***
# Display HighSalary p-values
cat("P-values for HighSalary coefficients:\n")
## P-values for HighSalary coefficients:
kable(high_salary_df, format = "html", escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "left")
Variable P.value
(Intercept) 0***
AtBat 0.6253
Hits 0.7984
HmRun 0.1056
Runs 0.2774
RBI 0.3137
Walks 0.1006
Years 0.1229
CAtBat 0.3709
CHits 0.2523
CHmRun 0.5911
CRuns 0.7039
CRBI 0.7145
CWalks 0.7854
LeagueN 0***
DivisionW 0***
PutOuts 0.1723
Assists 0.338
Errors 0.2642
NewLeagueN 0***

Interpretation of P-values:

The p-values above show which variables are statistically significant predictors of being in the Medium or High salary categories compared to the Low salary category. Key observations:

  1. Years: Years of experience is highly significant for both Medium and High salary categories, indicating that experienced players are more likely to earn higher salaries.
  2. Career Statistics: Career metrics like CHits (career hits) and CWalks (career walks) are important predictors of salary level, showing that long-term performance is rewarded.
  3. Current Season Performance: While some current season statistics are significant, they generally show higher p-values than career metrics, suggesting that teams value consistent long-term performance more than single-season results.
  4. League and Division Effects: There are some significant differences between leagues and divisions, with certain combinations showing higher probability of being in higher salary categories.

5.5 Model Evaluation

# Evaluate the full model using our function
full_model_eval <- evaluate_model(model_full, hitters)

To assess how well our model performs, we evaluate it using several metrics including a confusion matrix and classification performance measures. The confusion matrix below shows the counts of correct and incorrect predictions:

# Display the confusion matrix using kable for better formatting
conf_matrix_df <- as.data.frame(full_model_eval$conf_matrix)
names(conf_matrix_df) <- c("Actual", "Predicted", "Count")

# Create a better looking table
conf_matrix_wide <- reshape2::dcast(conf_matrix_df, Predicted ~ Actual, value.var = "Count")
rownames(conf_matrix_wide) <- conf_matrix_wide$Predicted
conf_matrix_wide$Predicted <- NULL

kable(conf_matrix_wide, format = "html", caption = "Confusion Matrix for Full Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Actual" = ncol(conf_matrix_wide)))
Confusion Matrix for Full Model
Actual
LowSalary MediumSalary HighSalary
LowSalary 78 11 3
MediumSalary 19 47 15
HighSalary 7 16 63

Our model achieves an overall accuracy of 72.59%, meaning it correctly predicts the salary level for this percentage of players in our dataset.

5.6 Class-Specific Performance Metrics

The confusion matrix provides an overview, but we need more detailed metrics to understand model performance for each salary level. Below are the class-specific metrics:

# Create a formatted table of class-specific metrics
metrics_df <- data.frame(
  SalaryLevel = names(full_model_eval$precision),
  Precision = round(full_model_eval$precision, 3),
  Recall = round(full_model_eval$recall, 3),
  F1_Score = round(full_model_eval$f1_score, 3)
)

# Create a better looking table
kable(metrics_df, format = "html", 
      caption = "Class-Specific Performance Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE)
Class-Specific Performance Metrics
SalaryLevel Precision Recall F1_Score
LowSalary LowSalary 0.848 0.750 0.796
MediumSalary MediumSalary 0.580 0.635 0.606
HighSalary HighSalary 0.733 0.778 0.754
# Create a bar plot of class metrics
library(reshape2)
metrics_long <- melt(metrics_df, id.vars = "SalaryLevel", 
                    variable.name = "Metric", value.name = "Value")

ggplot(metrics_long, aes(x = SalaryLevel, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = round(Value, 2)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 3) +
  labs(title = "Performance Metrics by Salary Level",
       x = "Salary Level", 
       y = "Score") +
  scale_fill_brewer(palette = "Set2") +
  ylim(0, max(metrics_long$Value) * 1.1) +
  my_theme

Performance Metrics Explanation:

  • Precision: The proportion of correct positive predictions among all predictions for a class. Higher precision means when we predict a player is in a particular salary level, we’re more likely to be correct.
  • Recall: The proportion of actual positives correctly identified. Higher recall means we’re identifying a greater percentage of all players who actually belong in a particular salary category.
  • F1-Score: The harmonic mean of precision and recall, providing a balanced metric that works well even with imbalanced classes.

Our model performs best for the LowSalary category with an F1-score of 0.796, while struggling more with the MediumSalary category (F1-score of 0.606). The visualization above clearly shows these performance differences across metrics and salary levels.

5.7 Visualizing the Confusion Matrix

To better understand the pattern of correct and incorrect predictions, we can visualize the confusion matrix:

Confusion Matrix Insights:

The confusion matrix visualization reveals several important patterns:

  1. The model performs particularly well at classifying players into the HighSalary salary category, with the highest diagonal value relative to total predictions.

  2. Most misclassifications occur between adjacent categories (Low to Medium or Medium to High), which is expected since salary categories are ordinal.

  3. The most problematic misclassification is low salary players classified as high salary, which represents the largest error in terms of distance between actual and predicted categories.

These findings indicate that while our model has good overall performance, there is room for improvement, particularly in distinguishing between the extreme categories.

5.8 Checking for Multicollinearity

5.8.1 Understanding Multicollinearity

Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. This can cause several problems:

  1. Unstable Coefficients: Small changes in the data can lead to large changes in coefficient estimates, making them unreliable.
  2. Inflated Standard Errors: The standard errors of coefficients increase, potentially making significant variables appear non-significant.
  3. Difficulty in Interpretation: It becomes challenging to determine the individual effect of each predictor when they’re highly correlated.
  4. Reduced Predictive Power: While overall model performance may remain strong, the model may struggle when making predictions with new data.

The Variance Inflation Factor (VIF) is a common metric used to detect multicollinearity:

  • VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity
  • VIF = 1: No multicollinearity
  • VIF between 1-5: Moderate multicollinearity
  • VIF > 5: High multicollinearity
  • VIF > 10: Severe multicollinearity that requires correction

5.8.2 Variance Inflation Factors (VIF) Analysis

To detect multicollinearity among our predictors, we calculated the Variance Inflation Factor (VIF) for each variable:

# Create a df for better visualization
vif_df <- data.frame(
  Variable = names(vif_values),
  VIF = as.numeric(vif_values)
)

# Sort by VIF value
vif_df <- vif_df[order(-vif_df$VIF),]

# Display VIF values as a formatted table
kable(vif_df, format = "html", caption = "Variance Inflation Factors (VIF)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE) %>%
  row_spec(which(vif_df$VIF > 10), background = "#FFCCCC") %>%
  row_spec(which(vif_df$VIF > 5 & vif_df$VIF <= 10), background = "#FFFFCC")
Variance Inflation Factors (VIF)
Variable VIF
8 CHits 456.858946
7 CAtBat 210.924921
10 CRuns 155.852336
11 CRBI 130.538923
9 CHmRun 45.552414
12 CWalks 17.869983
3 Runs 15.276098
1 Hits 14.892967
4 RBI 11.882255
6 Years 9.175317
2 HmRun 7.872857
13 League 4.041868
18 NewLeague 4.001854
5 Walks 3.744660
16 Assists 2.708386
17 Errors 2.165811
15 PutOuts 1.229823
14 Division 1.070869
# Visualize VIF values
ggplot(vif_df, aes(x = reorder(Variable, VIF), y = VIF)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_hline(yintercept = 5, linetype = "dashed", color = "orange") +
  geom_hline(yintercept = 10, linetype = "dashed", color = "red") +
  geom_text(aes(label = round(VIF, 1)), hjust = -0.2, size = 3) +
  coord_flip() +
  labs(title = "Variance Inflation Factors for Predictors",
       subtitle = "Orange line: moderate multicollinearity (VIF=5), Red line: severe multicollinearity (VIF=10)",
       x = "Variable",
       y = "VIF") +
  my_theme

5.8.3 Multicollinearity Interpretation

Our VIF analysis reveals significant multicollinearity issues in our dataset:

  1. Severe Multicollinearity (VIF > 10):
    • Career statistics show extremely high VIF values, particularly CAtBat, CHits, and CRuns
    • Current season statistics like AtBat and Hits also show concerning levels of correlation
    • This level of multicollinearity indicates that these variables contain redundant information
  2. Impact on Model:
    • The high VIF values explain why some variables with strong theoretical relationships to salary show insignificant p-values in our model
    • Coefficient estimates for these variables are likely unstable and may change dramatically with small changes in the data
    • The standard errors are inflated, reducing our ability to detect truly significant relationships
  3. Potential Solutions:
    • Remove highly correlated predictors (keep only one from each correlated group)
    • Use Principal Component Analysis (PCA) to transform correlated predictors into uncorrelated components
    • Apply regularization techniques like ridge regression or LASSO
    • Create composite variables that combine related predictors

Given the severity of multicollinearity in our data, we’ll use Principal Component Analysis (PCA) to address this issue and improve our model.

5.9 Addressing Multicollinearity with PCA

5.9.1 Introduction to Principal Component Analysis

Principal Component Analysis (PCA) is a dimension-reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated variables called principal components. These components are linear combinations of the original variables, ordered by the amount of variance they explain.

Key benefits of PCA for our analysis:

  1. Eliminates Multicollinearity: By definition, principal components are orthogonal (uncorrelated) to each other.
  2. Reduces Dimensionality: We can often capture most of the variance in the data with fewer components than original variables.
  3. Preserves Information: The transformation aims to retain as much information (variance) as possible.
  4. Improves Model Stability: By using uncorrelated predictors, we obtain more stable coefficient estimates.

5.9.2 PCA Results

The summary of our PCA analysis shows how the variance is distributed across the principal components:

# Format PCA summary into a nicer table
pca_summary <- summary(pca_result)

# Extract variance explained by each component
variance_explained <- pca_summary$importance[2,] * 100  # Convert to percentage
cumulative_variance <- pca_summary$importance[3,] * 100

# Create a data frame for a nice table
pca_table <- data.frame(
  Component = paste0("PC", 1:length(variance_explained)),
  'Standard Deviation' = pca_summary$importance[1,],
  'Proportion of Variance' = paste0(round(variance_explained, 2), "%"),
  'Cumulative Proportion' = paste0(round(cumulative_variance, 2), "%")
)

# Display as a nice table
kable(pca_table, format = "html", caption = "PCA Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE)
PCA Summary
Component Standard.Deviation Proportion.of.Variance Cumulative.Proportion
PC1 PC1 2.6857212 45.08% 45.08%
PC2 PC2 2.0336352 25.85% 70.93%
PC3 PC3 1.3154568 10.81% 81.75%
PC4 PC4 0.9326315 5.44% 87.18%
PC5 PC5 0.8379392 4.39% 91.57%
PC6 PC6 0.7185862 3.23% 94.8%
PC7 PC7 0.5014056 1.57% 96.37%
PC8 PC8 0.4305638 1.16% 97.53%
PC9 PC9 0.3637137 0.83% 98.35%
PC10 PC10 0.3142602 0.62% 98.97%
PC11 PC11 0.2470222 0.38% 99.35%
PC12 PC12 0.2338369 0.34% 99.69%
PC13 PC13 0.1680639 0.18% 99.87%
PC14 PC14 0.1207260 0.09% 99.96%
PC15 PC15 0.0701870 0.03% 99.99%
PC16 PC16 0.0347412 0.01% 100%

Next, we visualize the proportion of variance explained by each principal component to understand how many components we should retain for our analysis.

# Extract variance explained by each component
variance_explained <- pca_summary$importance[2,] * 100  # Convert to percentage
cumulative_variance <- pca_summary$importance[3,] * 100

# Create a data frame for plotting
variance_df <- data.frame(
  Component = paste0("PC", 1:length(variance_explained)),
  Variance = variance_explained,
  Cumulative = cumulative_variance
)

# Ensure components are ordered correctly by making Component a factor with levels in the right order
variance_df$Component <- factor(variance_df$Component, levels = paste0("PC", 1:length(variance_explained)))

# Create a more visually appealing color palette
variance_colors <- colorRampPalette(c("#4169E1", "#87CEEB"))(length(variance_explained))

# Visualize variance explained with enhanced aesthetics
ggplot(variance_df, aes(x = Component, y = Variance)) +
  geom_bar(stat = "identity", aes(fill = Component)) +
  scale_fill_manual(values = variance_colors) +
  geom_text(aes(label = sprintf("%.1f%%", Variance)), vjust = -0.5, size = 3.5, fontface = "bold") +
  geom_line(aes(y = Cumulative, group = 1), color = "red", size = 1.2) +
  geom_point(aes(y = Cumulative), color = "red", size = 3) +
  scale_y_continuous(
    limits = c(0, max(max(variance_df$Variance) * 1.15, 100)),
    sec.axis = sec_axis(~., name = "Cumulative Variance (%)")
  ) +
  labs(title = "Variance Explained by Principal Components",
       subtitle = "Bars: Individual variance, Red line: Cumulative variance",
       x = "Principal Component",
       y = "Variance Explained (%)") +
  my_theme +
  theme(
    legend.position = "none",
    panel.grid.major.y = element_line(color = "gray90"),
    axis.text.x = element_text(angle = 45, hjust = 1, face = "bold"),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, face = "italic")
  )

5.9.3 Interpreting the Variance Plot

The variance explained plot provides crucial insights:

  1. First Few Components: The first few principal components explain a substantial portion of the total variance. Specifically, PC1 explains about 45.1% of the variance, and PC2 adds another 25.8%.
  2. Diminishing Returns: There’s a sharp drop after the first few components, indicating that most of the information is captured in these initial components.
  3. Cumulative Variance: The red line shows that we can capture approximately 90% of the total variance with just 5 components, which is significantly fewer than our original 16 variables.
  4. Elbow Point: We can observe an “elbow” in the scree plot around PC4-PC5, suggesting this might be a reasonable cutoff point for dimension reduction.

To understand how our original variables relate to these principal components, we examine the PCA biplot:

# Create a simplified biplot with better readability
# Get the loadings (relationships between original variables and PCs)
loadings <- pca_result$rotation[, 1:2]
loadings_df <- as.data.frame(loadings)
loadings_df$Variable <- rownames(loadings_df)

# Create a custom biplot
ggplot() +
  # Plot the PC scores
  geom_point(data = data.frame(pca_result$x[,1:2]), 
             aes(x = PC1, y = PC2), 
             alpha = 0.5, color = "darkgrey") +
  # Add arrows for variable loadings
  geom_segment(data = loadings_df,
               aes(x = 0, y = 0, xend = PC1*5, yend = PC2*5),
               arrow = arrow(length = unit(0.2, "cm")), color = "blue") +
  # Add variable labels at the end of arrows
  geom_text(data = loadings_df,
            aes(x = PC1*5.5, y = PC2*5.5, label = Variable),
            size = 3) +
  # Add some theming
  labs(title = "PCA Biplot (First Two Principal Components)",
       subtitle = "Showing the relationship between variables and principal components",
       x = paste0("PC1 (", round(variance_explained[1], 1), "% variance)"),
       y = paste0("PC2 (", round(variance_explained[2], 1), "% variance)")) +
  my_theme +
  coord_fixed() +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.3) +
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.3)

5.9.4 Interpreting the PCA Biplot

The biplot reveals important relationships between our original variables and the first two principal components:

  1. PC1 Relationships:
    • Variables pointing strongly to the right (positive PC1) include career statistics (CRuns, CHits, CRBI) and current season offensive statistics (Hits, RBI).
    • This suggests PC1 primarily represents overall offensive production and career longevity.
    • Players with high PC1 scores are likely veteran players with strong offensive records.
  2. PC2 Relationships:
    • Variables with strong vertical loadings (positive or negative PC2) include Years (positive) and several current season statistics (negative).
    • PC2 appears to separate career longevity from current season performance.
    • Players with high PC2 scores may be experienced players whose current season performance differs from their career averages.
  3. Clustered Variables:
    • Notice how career statistics (CAtBat, CHits, CRuns, CRBI) cluster together, confirming their high correlation.
    • Similarly, current season statistics (AtBat, Hits, Runs) form another cluster.
    • These clusters validate our multicollinearity concerns and show how PCA effectively groups correlated variables.

5.9.5 Selecting the Optimal Number of Components

To determine how many principal components to retain for our analysis, we consider several common criteria:

Based on our analysis, we can use several criteria to select the optimal number of principal components:

  1. Kaiser Criterion (eigenvalues > 1): This criterion suggests keeping 3 components, as these have standard deviations greater than 1.0. Components with eigenvalues less than 1 explain less variance than a single original variable would, making them less informative for our analysis.
  2. Variance Threshold: To explain 95% of the total variance in our data, we need 7 components. This is a significant reduction from our original 16 variables, while still preserving most of the information content. The 95% threshold is widely accepted in statistical practice as it retains the vast majority of information while achieving substantial dimensionality reduction.
  3. Scree Plot Analysis: Looking at the “elbow” in our scree plot above, we see diminishing returns after the first 4-5 components. Each additional component beyond this point contributes minimally to the explained variance, suggesting a natural cutoff point.

Application to Baseball Salary Analysis:

For our baseball salary prediction model, we’ve chosen to use the 7 components needed to explain 95% of the variance. This approach strikes an optimal balance for our specific application:

  • Multicollinearity Reduction: By transforming our highly correlated baseball statistics (with VIF values exceeding 300 in some cases) into uncorrelated components, we address the severe multicollinearity in our original data.
  • Information Preservation: We retain 95% of the information in our original variables, ensuring that important relationships between performance metrics and salary levels are preserved.
  • Interpretable Components: The first few components have clear interpretations in the baseball context:
    • PC1 primarily represents overall offensive production and career longevity
    • PC2 distinguishes between career experience and current season performance
    • Additional components capture more nuanced performance aspects
  • Model Stability: Using these uncorrelated components results in more stable coefficient estimates in our multinomial logistic regression model, improving its reliability and generalizability.

This dimensionality reduction approach is particularly valuable for baseball salary analysis, where numerous performance metrics are inherently correlated (e.g., at-bats with hits, or career statistics with years played). PCA effectively distills these interrelated metrics into their essential underlying patterns, enabling more robust salary prediction while maintaining interpretability.

5.9.6 Using PCA Components in the Multinomial Model

# Combining PCA components with target variable
# We'll use the number of components needed to explain 95% of variance
hitters_pca <- data.frame(
  pca_result$x[, 1:var95_components],
  SalaryLevel = hitters$SalaryLevel
)

# Fitting a multinomial model on PCA-transformed data
model_pca <- multinom(SalaryLevel ~ ., data = hitters_pca, trace = FALSE)

After identifying the optimal number of principal components, we combine these components with our target variable SalaryLevel to create a new dataset for modeling. We then fit a multinomial logistic regression model using these principal components as predictors.

This approach has several advantages: - The principal components are uncorrelated, eliminating multicollinearity concerns - We’re using a more parsimonious model with fewer predictors - The model is likely to be more stable and generalizable - We retain most of the information from the original variables while reducing noise

The key trade-off is interpretability: the coefficients now relate to principal components rather than directly to baseball statistics. However, using our understanding of what each principal component represents, we can still derive meaningful insights from the model.

5.10 Evaluating PCA Model Performance

After fitting our multinomial logistic regression model using principal components as predictors, we need to assess its performance and compare it with our original model.

5.10.1 Confusion Matrix for PCA Model

The confusion matrix below shows how well our PCA-based model classifies players into different salary categories:

# Display the confusion matrix using kable for better formatting
pca_conf_matrix_df <- as.data.frame(pca_model_eval$conf_matrix)
names(pca_conf_matrix_df) <- c("Actual", "Predicted", "Count")

# Create a better looking table
pca_conf_matrix_wide <- reshape2::dcast(pca_conf_matrix_df, Predicted ~ Actual, value.var = "Count")
rownames(pca_conf_matrix_wide) <- pca_conf_matrix_wide$Predicted
pca_conf_matrix_wide$Predicted <- NULL

kable(pca_conf_matrix_wide, format = "html", caption = "Confusion Matrix for PCA Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Actual" = ncol(pca_conf_matrix_wide)))
Confusion Matrix for PCA Model
Actual
LowSalary MediumSalary HighSalary
LowSalary 80 7 5
MediumSalary 24 32 25
HighSalary 6 18 62
# Create a more visually appealing heatmap
ggplot(pca_conf_matrix_df, aes(x = Actual, y = Predicted, fill = Count)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Count), color = "black", size = 4) +
  scale_fill_gradient(low = "white", high = "#4292c6") +
  labs(title = "PCA Model Confusion Matrix",
       x = "Actual Salary Level",
       y = "Predicted Salary Level") +
  my_theme +
  theme(legend.position = "none")

Confusion Matrix Interpretation:

The confusion matrix visualization provides important insights into our PCA model’s performance:

  1. Diagonal Elements: The diagonal cells (reading from top-left to bottom-right) show correctly classified instances. Our model performs particularly well for the HighSalary category, with the majority of high-salary players being correctly identified.
  2. Off-Diagonal Elements: These represent misclassifications. The most common error is classifying MediumSalary players as having LowSalary, followed by classifying LowSalary players as having MediumSalary.
  3. Extreme Misclassifications: Very few players are misclassified across extreme categories (Low to High or High to Low), indicating that the model generally distinguishes between these distant categories effectively.

The heatmap visualization highlights these patterns clearly, with the darker diagonal cells representing higher counts of correct classifications.

5.10.2 PCA Model Accuracy

Our PCA-based model achieves an overall accuracy of 67.18%. This means the model correctly classifies approximately 67.18 out of 100 players into their correct salary categories.

The accuracy is particularly noteworthy because we’ve reduced the number of predictors from 19 original variables to just 7 principal components while maintaining comparable predictive performance. This demonstrates the effectiveness of PCA in addressing multicollinearity while preserving predictive power.

5.10.3 Class-Specific Performance Metrics for PCA Model

To understand how well our model performs for each salary level, we examine the class-specific metrics:

# Create a formatted table of class-specific metrics
pca_metrics_df <- data.frame(
  SalaryLevel = names(pca_model_eval$precision),
  Precision = round(pca_model_eval$precision, 3),
  Recall = round(pca_model_eval$recall, 3),
  F1_Score = round(pca_model_eval$f1_score, 3)
)

# Create a better looking table
kable(pca_metrics_df, format = "html", 
      caption = "Class-Specific Performance Metrics for PCA Model") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE)
Class-Specific Performance Metrics for PCA Model
SalaryLevel Precision Recall F1_Score
LowSalary LowSalary 0.870 0.727 0.792
MediumSalary MediumSalary 0.395 0.561 0.464
HighSalary HighSalary 0.721 0.674 0.697
# Create a bar plot of class metrics
library(reshape2)
metrics_long <- melt(pca_metrics_df, id.vars = "SalaryLevel", 
                    variable.name = "Metric", value.name = "Value")

ggplot(metrics_long, aes(x = SalaryLevel, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = round(Value, 2)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 3) +
  labs(title = "Performance Metrics by Salary Level for PCA Model",
       x = "Salary Level", 
       y = "Score") +
  scale_fill_brewer(palette = "Set2") +
  ylim(0, max(metrics_long$Value) * 1.1) +
  my_theme

Interpretation of Class-Specific Metrics:

The performance metrics chart above provides a detailed breakdown of how our PCA model performs for each salary category:

  • Precision: Our model achieves the highest precision for LowSalary players (87%). This means when the model predicts a player belongs to this category, it’s correct approximately 87 out of 100 times. The precision is lowest for MediumSalary players, indicating more false positives for this category.
  • Recall: The model is best at identifying LowSalary players (recall of 72.7%), meaning it correctly identifies this percentage of all players who actually belong in this category. Lower recall for MediumSalary players suggests the model misses more players who truly belong in this category.
  • F1-Score: The balanced measure between precision and recall is highest for LowSalary players (79.2%), suggesting this is the category the model handles most effectively overall.

These metrics reveal that our model’s performance varies across categories, with particular strength in identifying LowSalary salary players and more challenges with the MediumSalary category.

5.10.4 Model Accuracy Comparison

To evaluate whether our PCA approach improved the model, we compare the accuracy of both models:

# Create comparison dataframe for visualization
model_comparison <- data.frame(
  Model = c("Full Model", "PCA Model"),
  Accuracy = c(full_model_eval$accuracy, pca_model_eval$accuracy)
)

# Visualize accuracy comparison
ggplot(model_comparison, aes(x = Model, y = Accuracy, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = sprintf("%.1f%%", Accuracy*100)), vjust = -0.5) +
  scale_fill_manual(values = c("Full Model" = "#4292c6", "PCA Model" = "#41ab5d")) +
  ylim(0, max(model_comparison$Accuracy) * 1.1) +
  labs(title = "Model Accuracy Comparison",
       subtitle = "Full model vs. PCA-based model",
       y = "Accuracy") +
  my_theme +
  theme(legend.position = "none")

Comparison of Model Accuracies

The original full model achieved an accuracy of 72.59%, while our PCA model achieved an accuracy of 67.18%. This represents a difference of 5.41 percentage points lower than the PCA model.

The accuracy comparison shows that our PCA-based model slightly sacrifices prediction accuracy while offering significant benefits in terms of model simplicity and stability. By using uncorrelated principal components instead of the original multicollinear predictors, we’ve created a more robust model that should generalize better to new data.

5.10.5 Interpretation of PCA Results

The PCA transformation has effectively addressed the multicollinearity issue in our data while maintaining predictive performance. Our analysis reveals:

  1. Variance Explained: The first 7 principal components capture approximately 95% of the variance in the original data, allowing us to reduce dimensionality while retaining most of the information.
  2. Model Performance: The PCA-based model achieved an accuracy of 67.18%, which is comparable to the original model’s accuracy of 72.59%.
  3. Component Interpretation:
    • PC1 primarily represents overall offensive production and career longevity
    • PC2 separates career experience from current season performance
    • Later components capture more nuanced aspects of player performance
    • This provides some interpretability despite the abstraction of principal components
  4. Dimensionality Reduction: We reduced from 16 original predictors to 7 principal components while maintaining similar predictive performance, a significant simplification.
  5. Statistical Stability: By eliminating multicollinearity, the PCA model’s coefficients are more stable and reliable, making it better suited for inference and prediction with new data.

Using PCA has successfully addressed the multicollinearity concern while maintaining predictive performance, providing an alternative modeling approach that complements our original model.

5.10.6 Interpretation of Model Results

The multinomial logistic regression models (both original and PCA-based) predict the probability of a player belonging to different salary levels based on various performance metrics. The results provide several valuable insights:

5.10.7 Key Significant Predictors (Original Model):

  1. Years in League: With very low p-values for both Medium and High salary categories, experience appears to be a significant predictor of salary level. The positive coefficients indicate that as experience increases, the likelihood of being in a higher salary tier also increases.
  2. Career Metrics: Variables like CHits (career hits) and CRBI (career RBIs) have lower p-values, suggesting career performance is more predictive of salary than single-season statistics.
  3. League and Division: There are some differences between leagues and divisions, with players in certain categories having higher probabilities of being in higher salary tiers.

5.10.8 PCA Model Insights:

In the PCA model, the first few principal components are particularly important in predicting salary levels:

  1. PC1 (Overall Offensive Production): This component, representing general offensive ability and career longevity, strongly predicts higher salary levels.
  2. PC2 (Experience vs. Current Performance): This component helps distinguish between players with similar overall statistics but different experience levels.
  3. Higher-Order Components: These capture more subtle variations in player statistics and help improve the model’s accuracy by accounting for specific combinations of statistics not captured by the first few components.

5.10.9 Model Performance Assessment:

Both models achieved strong predictive performance:

  • Original Model: Accuracy of 72.59%, with strongest performance in the LowSalary category.
  • PCA Model: Accuracy of 67.18%, with similar performance across categories and reduced risk of overfitting.
  • Precision and Recall Trade-offs: Both models show a trade-off between precision and recall, with some categories being easier to identify correctly than others.

5.10.10 Limitations:

  1. Category Boundaries: The boundaries between salary categories are somewhat arbitrary and may affect model performance.
  2. Non-Performance Factors: Our models don’t account for non-statistical factors that influence salaries, such as market size, team budget, or player popularity.
  3. Temporal Effects: Baseball salaries have changed over time, and our models don’t account for these temporal trends.
  4. Sample Size: The relatively small dataset may limit the generalizability of our findings.

In conclusion, both models provide valuable insights into the factors influencing baseball player salaries, with the PCA approach offering a more stable and parsimonious solution that addresses multicollinearity while maintaining predictive power.

6 Model Evaluation via Cross-Validation

To provide a more robust evaluation of our model’s predictive performance, we’ll use cross-validation methods to estimate the generalization error. Cross-validation helps us understand how well our model will perform on unseen data by repeatedly splitting the data into training and testing sets.

set.seed(06052004)

# Define function to run cross-validation and collect results
run_cross_validation <- function(formula, data, method = "cv", number = 5, 
                                model_type = "multinom", model_name = "Model") {
  # Set up cross-validation control
  ctrl <- trainControl(
    method = method,
    number = number,
    classProbs = TRUE,
    summaryFunction = multiClassSummary
  )
  
  # Perform cross-validation
  cv_model <- train(
    formula,
    data = data,
    method = model_type,
    trControl = ctrl,
    trace = FALSE
  )
  
  # Extract results
  best_result <- cv_model$results[which.max(cv_model$results$Accuracy),]
  
  # Calculate error rate
  error_rate <- 1 - best_result$Accuracy
  
  # Return results in a list
  return(list(
    model = cv_model,
    accuracy = best_result$Accuracy,
    error_rate = error_rate,
    balanced_accuracy = best_result$AUC,
    kappa = best_result$Kappa,
    model_name = model_name,
    cv_method = paste0(number, "-fold CV")
  ))
}

6.1 5-Fold Cross-Validation for Full Model

# Run 5-fold cross-validation on the full model
cv5_full <- run_cross_validation(
  formula = SalaryLevel ~ . -Salary,
  data = hitters,
  number = 5,
  model_name = "Full Model"
)
5-Fold Cross-Validation Results for Full Model
Metric Value
Accuracy 63.67%
Error Rate 36.33%
Balanced Accuracy 82.41%
Kappa 0.452

6.2 10-Fold Cross-Validation for Full Model

# Run 10-fold cross-validation on the full model
cv10_full <- run_cross_validation(
  formula = SalaryLevel ~ . -Salary,
  data = hitters,
  number = 10,
  model_name = "Full Model"
)
10-Fold Cross-Validation Results for Full Model
Metric Value
Accuracy 64.97%
Error Rate 35.03%
Balanced Accuracy 84.00%
Kappa 0.4725

6.3 Cross-Validation for PCA Model

# Run 5-fold cross-validation on the PCA model
cv5_pca <- run_cross_validation(
  formula = SalaryLevel ~ .,
  data = hitters_pca,
  number = 5,
  model_name = "PCA Model"
)
5-Fold Cross-Validation Results for PCA Model
Metric Value
Accuracy 65.22%
Error Rate 34.78%
Balanced Accuracy 83.95%
Kappa 0.4745
# Run 10-fold cross-validation on the PCA model for consistency
cv10_pca <- run_cross_validation(
  formula = SalaryLevel ~ .,
  data = hitters_pca,
  number = 10,
  model_name = "PCA Model"
)
10-Fold Cross-Validation Results for PCA Model
Metric Value
Accuracy 66.25%
Error Rate 33.75%
Balanced Accuracy 83.58%
Kappa 0.4898

6.4 Comparison of Cross-Validation Results

Comparison of Cross-Validation Results Across Models and Methods
Model CV_Method Accuracy Balanced Accuracy Kappa
Full Model 5-fold CV 63.67% 82.41% 0.4520
Full Model 10-fold CV 64.97% 84.00% 0.4725
PCA Model 5-fold CV 65.22% 83.95% 0.4745
PCA Model 10-fold CV 66.25% 83.58% 0.4898

Table Interpretation: This table provides a comprehensive comparison of our model performance across different cross-validation methods. The metrics show consistent performance between 5-fold and 10-fold cross-validation, indicating stable model behavior. The PCA model performs similarly to the full model in terms of accuracy, suggesting that dimensionality reduction effectively preserved the predictive information while addressing multicollinearity.

Figure Interpretation: This visualization compares the performance metrics (Accuracy, Balanced Accuracy, and Kappa) across different models and cross-validation methods. The bar chart clearly shows that both models perform similarly across metrics, with consistent results between 5-fold and 10-fold validation approaches. The PCA model demonstrates comparable performance to the full model, indicating successful dimensionality reduction without sacrificing predictive power.

6.5 Discussion of Cross-Validation Results

The cross-validation results provide several key insights about our models:

  1. Consistency Across Folds:
    • Both 5-fold and 10-fold cross-validation yield similar results for each model type
    • For the full model, 5-fold CV accuracy is 63.67% while 10-fold CV accuracy is 64.97%
    • For the PCA model, 5-fold CV accuracy is 65.22% and 10-fold CV accuracy is 66.25%
    • This consistency across fold counts suggests our models have stable performance regardless of validation scheme
    • The minimal differences between 5-fold and 10-fold results indicate that either approach is sufficient for this dataset
  2. Model Comparison:
    • The PCA model shows better cross-validated accuracy than the full model
    • With 10-fold CV, the PCA model achieves 66.25% accuracy compared to the full model’s 64.97%
    • This demonstrates that dimensionality reduction through PCA can maintain or even slightly improve predictive performance while addressing multicollinearity
  3. Accuracy vs. Balanced Accuracy:
    • The balanced accuracy metrics are similar to the regular accuracy metrics across all models
    • For example, the 10-fold CV full model has 64.97% accuracy and 84% balanced accuracy
    • This indicates our models perform consistently across all salary level categories
    • This balanced performance is expected since we created equal-sized categories using tertiles
  4. Kappa Statistic:
    • Kappa values for our models range from 0.452 to 0.4898
    • These values indicate moderate agreement between predicted and actual classes beyond what would be expected by chance
    • The PCA model with 10-fold CV achieved a kappa of 0.4898, which is higher than the full model’s kappa of 0.4725

6.5.1 Choice of Cross-Validation Methods

We selected these cross-validation methods for the following reasons:

  1. 5-Fold CV: This provides a good balance between bias and variance in the error estimate. Each fold contains approximately 20% of the data, providing reasonable test set sizes.
  2. 10-Fold CV: This is often recommended as a standard approach, as it typically provides lower bias than 5-fold CV but may have higher variance. The comparison allows us to assess the stability of our models.

For our final model evaluations and comparisons, we’ve chosen to focus primarily on the 10-fold CV results, as they typically provide a slightly more reliable estimate of model performance with less bias than 5-fold CV. The consistent results between 5-fold and 10-fold CV give us confidence in the stability of our models.

6.5.2 Implications for Model Selection

The cross-validation results suggest that:

  1. Our multinomial logistic regression models achieve moderate predictive accuracy (around 66.2% with 10-fold CV)
  2. Performance is remarkably consistent across different validation schemes
  3. No single class (salary level) is being predicted substantially better or worse than others
  4. The PCA approach effectively addresses multicollinearity while maintaining or slightly improving predictive performance

We will evaluate a stepwise selection model next and compare its performance to both the full model and the PCA model to determine the best approach for our final model. The consistency between 5-fold and 10-fold cross-validation results gives us confidence in the reliability of our model evaluations.

7 Model Improvement: Predictor Selection

7.1 The Role of Predictor Selection in Statistical Modeling

Feature selection is a critical step in building efficient and interpretable statistical models, especially when dealing with a large number of potentially correlated predictors. In our baseball salary prediction context, selecting an optimal subset of predictors offers several important benefits:

  1. Addressing Multicollinearity: As demonstrated in our earlier VIF analysis, many baseball statistics are highly correlated (e.g., hits and RBIs). Feature selection helps identify and retain only the most informative variables, reducing redundancy and improving model stability.
  2. Enhancing Interpretability: A model with fewer predictors is easier to interpret and explain. This is particularly valuable in sports analytics where insights need to be communicated to non-technical stakeholders like coaches, players, and team management.
  3. Improving Generalization: Models with fewer parameters are less prone to overfitting, which occurs when a model captures noise rather than underlying patterns in the data. This typically results in better performance on new, unseen data.
  4. Computational Efficiency: Reducing the number of parameters decreases computational requirements, which becomes increasingly important with larger datasets or more complex modeling techniques.

7.2 Stepwise Selection Approach

For our analysis, we’ll employ stepwise selection, a systematic approach that iteratively adds or removes variables from the model based on statistical criteria. We’ll use the bidirectional method (“both”), which combines:

  • Forward selection: Starts with no variables and adds them one by one
  • Backward elimination: Starts with all variables and removes them one by one

The algorithm will use the Akaike Information Criterion (AIC) to guide variable selection. AIC balances model fit and complexity by penalizing models with more parameters:

\[ \text{AIC} = 2k - 2\ln(L) \]

Where: - \(k\) is the number of parameters - \(L\) is the maximum value of the likelihood function

A lower AIC indicates a better model. The stepwise procedure finds the model with the lowest AIC, effectively identifying the subset of predictors that provides the best balance between: - Maximizing predictive performance (likelihood) - Minimizing model complexity (number of parameters)

While AIC is our primary criterion, it’s worth noting that alternative criteria such as the Bayesian Information Criterion (BIC) could also be used. BIC applies a stronger penalty for model complexity:

\[ \text{BIC} = k\ln(n) - 2\ln(L) \]

Where \(n\) is the sample size. BIC typically results in more parsimonious models than AIC.

Let’s proceed with implementing stepwise selection on our multinomial logistic regression model:

# Use stepwise selection based on AIC
set.seed(06052004)

model_step <- stepAIC(model_full, direction = "both", trace = FALSE)

# Extract selected variables
selected_vars <- attr(terms(model_step), "term.labels")

Variables selected by stepwise procedure:

12 out of 19 original variables

Display the selected variables in a more readable format

The selected variables are:

  • AtBat
  • HmRun
  • Runs
  • RBI
  • Walks
  • Years
  • CHits
  • CWalks
  • Division
  • PutOuts
  • Assists
  • Errors

The stepwise selection process evaluates numerous model combinations to identify the most informative subset of predictors. The result is a more parsimonious model that aims to maintain predictive power while reducing complexity. The formula of this model represents the mathematical relationship between our response variable (SalaryLevel) and the selected predictors, with separate coefficients for each salary level comparison (Medium vs. Low and High vs. Low).

7.3 Analyzing the Selected Model

The visualization above presents the statistically significant predictors for each salary level comparison. Blue tiles indicate variables that significantly influence salary levels (p < 0.05). Variables that appear in blue for both comparisons are key factors consistently affecting salaries across different levels. The significance level of each predictor is shown with stars (*** p<0.001, ** p<0.01, * p<0.05), giving you a clearer picture of which variables have the strongest statistical evidence of influence on player salaries.

7.4 Comparing Model Complexity

Model Complexity Comparison
Model Parameters Reduction
Full Model 40 Reference
Stepwise Model 26 14 (35%)

Table Interpretation: This table quantifies the reduction in model complexity achieved through stepwise selection. The Stepwise Model uses significantly fewer parameters than the Full Model, offering a more parsimonious representation while still capturing the key relationships between predictors and salary levels.

Figure Interpretation: This visualization illustrates the substantial reduction in model complexity achieved through stepwise selection. By identifying and retaining only the most informative predictors, we’ve created a more streamlined model with fewer parameters, which promotes both interpretability and generalization ability.

Types of Variables in the Selected Model
Variable.Type Count Status Details
Current Season Statistics 8 8 variables AtBat, HmRun, Runs, RBI, Walks, PutOuts, Assists, Errors
Career Statistics 2 2 variables CHits, CWalks
Experience (Years) 1 Included Years in league
Categorical Variables 1 1 variables Division

Variable Type Analysis: The table above categorizes the selected predictors based on their nature. We observe a balanced mix of current season performance metrics, career statistics, experience indicators, and categorical variables in our final model. This suggests that salary determination in MLB considers multiple dimensions of a player’s profile, including both recent performance and career achievements. The specific variables within each category provide insight into which aspects of performance are most relevant for salary prediction in professional baseball.

7.5 Odds Ratio Interpretation

Understanding the practical impact of predictors in multinomial logistic regression requires interpreting the coefficients in terms of odds ratios. In our context, odds ratios reveal how different baseball statistics influence the probability of a player belonging to different salary categories.

7.5.1 What Are Odds Ratios?

Odds ratios represent the multiplicative change in the odds of being in a specific salary category (relative to the reference category) for a one-unit increase in a predictor. For example:

  • An odds ratio of 1.5 means a one-unit increase in the predictor multiplies the odds by 1.5 (50% increase)
  • An odds ratio of 0.7 means a one-unit increase in the predictor multiplies the odds by 0.7 (30% decrease)
  • An odds ratio of 1.0 means the predictor has no effect on the odds

For categorical predictors like League or Division, the odds ratio compares the odds for one category (e.g., National League) to the reference category (e.g., American League).

7.5.2 Methodology for Interpretation

To interpret our model: 1. We extract the coefficients from the multinomial logistic regression model 2. Exponentiate these coefficients to obtain odds ratios 3. Calculate p-values to determine statistical significance 4. Identify the strongest predictors for each salary level 5. Visualize and interpret the most influential factors

Here are the top predictors for each salary level based on their odds ratios:

Low Salary Predictors: The top graph shows variables that increase the odds of a player being in the low salary category. These are represented as inverse odds ratios from the high salary model, indicating factors that decrease the odds of earning high salaries. Variables like fewer career hits, lower batting metrics, and fewer years of experience significantly increase the likelihood of being in the low salary tier.

Medium Salary Predictors: The middle graph displays factors that most strongly predict medium salary levels (compared to low). Career statistics and current season performance metrics like runs and home runs show the strongest positive associations with medium salary levels.

High Salary Predictors: The bottom graph reveals the most influential factors for high salaries. Long-term career metrics and experience emerge as dominant predictors, suggesting that sustained performance over time is particularly rewarded at the highest salary levels.

7.5.3 Key Odds Ratio Interpretations

# Create a table for key odds ratio interpretations
key_interpretations <- data.frame(
  Variable = character(),
  Medium_OR = character(),
  High_OR = character(),
  Interpretation = character(),
  stringsToSep = NULL
)

# Add Years interpretation if present
if ("Years" %in% colnames(odds_ratios)) {
  key_interpretations <- rbind(key_interpretations, data.frame(
    Variable = "Years in league",
    Medium_OR = sprintf("%.2f", odds_ratios["MediumSalary", "Years"]),
    High_OR = sprintf("%.2f", odds_ratios["HighSalary", "Years"]),
    Interpretation = paste0("Each additional year of experience multiplies the odds of medium salary by ", 
                          sprintf("%.2f", odds_ratios["MediumSalary", "Years"]), 
                          " and high salary by ", 
                          sprintf("%.2f", odds_ratios["HighSalary", "Years"]), 
                          ". Experience strongly influences compensation.")
  ))
}

# Add Career Home Runs interpretation if present
if ("CHmRun" %in% colnames(odds_ratios)) {
  key_interpretations <- rbind(key_interpretations, data.frame(
    Variable = "Career Home Runs",
    Medium_OR = sprintf("%.3f", odds_ratios["MediumSalary", "CHmRun"]),
    High_OR = sprintf("%.3f", odds_ratios["HighSalary", "CHmRun"]),
    Interpretation = paste0("Each additional career home run multiplies the odds of medium salary by ", 
                          sprintf("%.3f", odds_ratios["MediumSalary", "CHmRun"]), 
                          " and high salary by ", 
                          sprintf("%.3f", odds_ratios["HighSalary", "CHmRun"]), 
                          ". Power hitting is valued in compensation decisions.")
  ))
}

# Add League interpretation if present
if ("LeagueN" %in% colnames(odds_ratios)) {
  key_interpretations <- rbind(key_interpretations, data.frame(
    Variable = "League (National vs American)",
    Medium_OR = sprintf("%.2f", odds_ratios["MediumSalary", "LeagueN"]),
    High_OR = sprintf("%.2f", odds_ratios["HighSalary", "LeagueN"]),
    Interpretation = paste0("Playing in the National League (vs American) multiplies the odds of medium salary by ", 
                          sprintf("%.2f", odds_ratios["MediumSalary", "LeagueN"]), 
                          " and high salary by ", 
                          sprintf("%.2f", odds_ratios["HighSalary", "LeagueN"]), 
                          ". ", ifelse(odds_ratios["HighSalary", "LeagueN"] > 1, 
                                     "National League players tend to earn more, controlling for other factors.", 
                                     "American League players tend to earn more, controlling for other factors."))
  ))
}

# Add another significant variable if available
if (length(significant_high) >= 4) {
  var_to_add <- significant_high[4]
  if (!(var_to_add %in% c("Years", "CHmRun", "LeagueN")) && var_to_add %in% colnames(odds_ratios)) {
    key_interpretations <- rbind(key_interpretations, data.frame(
      Variable = var_to_add,
      Medium_OR = sprintf("%.3f", odds_ratios["MediumSalary", var_to_add]),
      High_OR = sprintf("%.3f", odds_ratios["HighSalary", var_to_add]),
      Interpretation = paste0("Each additional unit of ", var_to_add, " multiplies the odds of medium salary by ", 
                            sprintf("%.3f", odds_ratios["MediumSalary", var_to_add]), 
                            " and high salary by ", 
                            sprintf("%.3f", odds_ratios["HighSalary", var_to_add]), ".")
    ))
  }
}

# Display the interpretations table
knitr::kable(key_interpretations, 
             caption = "Key Odds Ratio Interpretations",
             col.names = c("Variable", "Medium Salary OR", "High Salary OR", "Interpretation"),
             align = c("l", "c", "c", "l"),
             format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                          full_width = TRUE,
                          position = "center") %>%
  kableExtra::row_spec(0, background = "#f8f9fa", bold = TRUE) %>%
  kableExtra::column_spec(4, width = "50%")
Key Odds Ratio Interpretations
Variable Medium Salary OR High Salary OR Interpretation
Years in league 1.21 0.75 Each additional year of experience multiplies the odds of medium salary by 1.21 and high salary by 0.75. Experience strongly influences compensation.
DivisionW 1.393 0.365 Each additional unit of DivisionW multiplies the odds of medium salary by 1.393 and high salary by 0.365.

Additional Notes on Odds Ratio Interpretation:

  • Comparative Impact: Odds ratios allow us to compare the relative influence of different predictors, even when they are measured on different scales.
  • Contextual Interpretation: An odds ratio of 1.10 means a 10% increase in odds for each unit increase in the predictor. However, the practical significance depends on the variable’s range and typical increments.
  • Limitations: Odds ratios don’t directly translate to probability changes. A given odds ratio can correspond to different probability changes depending on the starting probability.
  • Reference Category: In our analysis, low salary is the reference category. Odds ratios for medium and high salary are relative to this base level.

7.6 Cross-Validation of the Selected Model

To properly evaluate our stepwise model’s performance and generalization ability, we’ll apply the same cross-validation approach we used for our previous models.

Cross-Validation Results for Stepwise Model
Metric X5.Fold.CV X10.Fold.CV
Accuracy 66.39% 67.62%
Error Rate 33.61% 32.38%
Balanced Accuracy 85.20% 85.26%
Kappa 0.4928 0.5119
In-sample Evaluation for Stepwise Model
Metric Value
Accuracy 73.36%

7.6.1 Understanding the Confusion Matrix

The confusion matrix provides a detailed breakdown of our model’s predictive performance by showing how instances from each actual class are distributed across predicted classes. This visualization reveals important patterns in our model’s classification behavior:

  1. Diagonal Elements (Correct Classifications):
    • The cells along the diagonal (top-left to bottom-right) represent correctly classified instances
    • Darker colors indicate higher numbers of correct classifications
    • Ideally, we want the majority of instances to fall along this diagonal
  2. Off-Diagonal Elements (Misclassifications):
    • These cells represent errors where the predicted class differs from the actual class
    • For example, the top-middle cell shows low salary players misclassified as medium salary
    • The pattern of misclassifications reveals systematic errors in the model
  3. Key Insights from Our Matrix:
    • Most misclassifications occur between adjacent categories (Low↔︎Medium or Medium↔︎High)
    • Very few instances are misclassified by two levels (Low→High or High→Low)
    • Medium salary players show the most misclassifications, suggesting this middle category has more ambiguous characteristics
  4. Performance Metrics Derivation:
    • Accuracy is calculated as the sum of the diagonal elements divided by the total number of instances
    • Class-specific metrics like precision and recall can be derived from the rows and columns
    • The overall pattern helps identify which classes the model handles well or poorly

The confusion matrix complements our numeric performance metrics by visualizing the specific types of errors made by the model, providing deeper insight into its strengths and limitations.

7.7 Comprehensive Model Comparison

Now that we have evaluated all three modeling approaches (Full Model, PCA Model, and Stepwise Model), we can perform a comprehensive comparison to identify the most effective approach for predicting MLB player salary levels.

Comprehensive Comparison of All Models
Model CV Method Accuracy Balanced Accuracy Kappa Parameters
Full Model
Full Model 5-Fold CV 63.67% 82.41% 0.4520 40
Full Model 10-Fold CV 64.97% 84.00% 0.4725 40
PCA Model
PCA Model 5-Fold CV 65.22% 83.95% 0.4745 7
PCA Model 10-Fold CV 66.25% 83.58% 0.4898 7
Stepwise Model
Stepwise Model 5-Fold CV 66.39% 85.20% 0.4928 26
Stepwise Model 10-Fold CV 67.62% 85.26% 0.5119 26

Comprehensive Model Comparison: This table presents a side-by-side comparison of all three modeling approaches across different cross-validation schemes. For each model, we report multiple performance metrics (Accuracy, Balanced Accuracy, and Kappa) as well as the number of parameters to represent model complexity. The 10-fold CV results generally provide the most reliable estimate of out-of-sample performance, while the parameter count reflects model simplicity and interpretability.

Accuracy vs. Complexity Visualization: This scatter plot positions each model according to its complexity (x-axis) and cross-validation accuracy (y-axis). The ideal model would appear in the top-left corner, representing high accuracy with low complexity. The plot visually demonstrates the trade-offs between model complexity and predictive performance across our three approaches. Points with the same color represent the same model evaluated with different cross-validation methods. The consistent results between 5-fold and 10-fold CV for each model suggest stable performance.

7.7.1 Model Selection Analysis

Based on our comprehensive evaluation, we can now determine which model offers the best balance of predictive performance, interpretability, and computational efficiency.

Best Performing Model Based on 10-Fold Cross-Validation
Best.Model X10.Fold.CV.Accuracy Parameters Key.Advantages
Stepwise Model 67.62% 26 Balanced performance and interpretability; uses original variables

The Stepwise Model emerges as our best-performing model based on 10-fold cross-validation accuracy. This conclusion is derived from:

  1. Systematic Evaluation Process:
    • We implemented identical cross-validation procedures for all three models
    • We used both 5-fold and 10-fold CV to ensure robust evaluation
    • We focused on 10-fold CV for final comparison as it typically provides a more reliable estimate
    • We evaluated multiple performance metrics including accuracy, balanced accuracy, and kappa
  2. Performance Metrics Analysis:
    • The Stepwise Model achieved the highest CV accuracy at 67.62%
    • The accuracy differences between models were substantial, indicating meaningful differences in predictive power
    • The balanced accuracy and kappa statistics show similar patterns, confirming the consistency of our findings
  3. Complexity Considerations:
    • The Stepwise Model uses 26 parameters
    • This represents a reduction of 35% compared to the full model

This model selection process demonstrates the importance of balancing multiple criteria when choosing a final model, including predictive performance, complexity, and interpretability.

7.7.2 In-Sample vs. Cross-Validation Performance Analysis

A critical aspect of model evaluation is comparing in-sample performance (how well the model fits the training data) with cross-validation performance (how well it generalizes to new data). A large gap between these metrics can indicate overfitting, where the model captures noise rather than underlying patterns.

In this analysis, we’ll: 1. Compare in-sample accuracy (on the full dataset) with cross-validation accuracy 2. Calculate the “generalization gap” (difference between in-sample and CV accuracy) 3. Evaluate which model shows the best generalization properties 4. Determine the most appropriate model for practical application

In-Sample vs. Cross-Validation Accuracy Comparison (using 10-fold CV)
Model In-Sample Accuracy CV Accuracy Generalization Gap
Full Model 72.59% 64.97% 7.62%
PCA Model 67.18% 66.25% 0.93%
Stepwise Model 73.36% 67.62% 5.74%

Generalization Gap Analysis: This table shows the accuracy of each model on the training data (In-Sample) compared to its performance in cross-validation (CV). The difference between these values, known as the generalization gap, indicates how well the model will perform on new, unseen data. Smaller gaps suggest better generalization ability. The model with the smallest gap (highlighted) demonstrates the most consistent performance between training and validation.

Visualization of Model Generalization: These plots illustrate the relationship between in-sample and cross-validation performance for each model:

  • The top plot directly compares in-sample and cross-validation accuracy for each model. Smaller differences between the paired bars indicate better generalization.
  • The bottom plot focuses specifically on the generalization gap (the difference between in-sample and CV accuracy). Smaller bars indicate models with better generalization properties that are less likely to overfit the training data.

These visualizations help identify models that not only perform well but also demonstrate reliable generalization to new data - a crucial property for practical applications.

Model Selection Analysis
Criterion Model Value
Best Generalization (smallest gap) PCA Model Gap: 0.9%
Best CV Performance Stepwise Model Accuracy: 67.6%
Recommended Model Stepwise Model or PCA Model* Trade-off between performance and generalization*

Based on our comprehensive analysis of both performance and generalization properties, we can make the following observations:

  1. Generalization Ability: The PCA Model demonstrates the smallest generalization gap at 0.9%, indicating it is the most reliable model when applied to new data.
  2. Predictive Performance: The Stepwise Model achieves the highest cross-validation accuracy at 67.6%, making it the top performer in terms of raw predictive power.
  3. Optimal Model Choice: There is a trade-off between the Stepwise Model (highest accuracy) and the PCA Model (best generalization). The choice depends on whether performance or reliability is prioritized for the specific application.
  4. Performance-Complexity Balance: When considering both predictive accuracy and model complexity, the Stepwise Model offers an excellent compromise, with substantially fewer parameters than the Full Model while maintaining competitive performance.

This detailed evaluation demonstrates the importance of considering multiple criteria when selecting a final model for practical application, rather than focusing solely on a single performance metric.

7.8 Analysis of Misclassified Cases

Understanding where our model makes errors is as valuable as knowing its overall accuracy. By analyzing the characteristics of misclassified players, we can gain insights into the model’s limitations and potential areas for improvement.

7.8.1 Understanding Misclassification Patterns

We’ll examine players that our stepwise model incorrectly classified to identify: - Which salary levels are most challenging to predict - Whether misclassifications tend to occur between adjacent categories - What player characteristics contribute to prediction errors

Misclassification Pattern Analysis: This visualization shows the different types of classification errors made by our model, with the percentage of players from each actual salary category that were misclassified. The colors represent the type of error:

  • Orange tiles show errors where the model overestimated salaries (predicting higher than actual)
  • Green tiles show errors where the model underestimated salaries (predicting lower than actual)
  • Blue tiles show errors spanning two levels (Low→High or High→Low)

The most common pattern involves confusing adjacent salary categories, while errors spanning two levels are rare. This indicates our model generally understands the salary hierarchy, even when it makes errors.

Misclassification Rates by Salary Level
Salary Level Total Players Misclassified Players Misclassification Rate (%) Correct Classification Rate (%)
LowSalary 92 12 13.0 87.0
MediumSalary 81 38 46.9 53.1
HighSalary 86 19 22.1 77.9

Salary-Level Specific Error Rates: This table shows how classification accuracy varies across the three salary levels. The medium salary category shows the highest misclassification rate, suggesting this middle group has more ambiguous characteristics that overlap with both low and high salary players. This is a common challenge in multi-class classification where middle categories often have less distinct boundaries.

7.8.2 Visualizing Classification Performance by Salary Level

The following visualization shows the accuracy of our model for each salary level, highlighting the proportion of correctly and incorrectly classified players in each category:

Classification Performance Visualization: This chart illustrates the classification accuracy for each salary level. The colored portion of each bar represents the percentage of correctly classified players, while the gray portion shows misclassifications. The Medium salary category has the highest error rate, while the High salary category shows the best classification performance. This suggests our model is most effective at identifying the highest-paid players.

7.8.3 Analyzing Characteristics of Misclassified Players

To better understand the specific player profiles that challenge our model, we’ll examine key attributes of misclassified players and how they differ from those who were correctly classified:

Analysis of Misclassified Player Characteristics:

The left plot examines the relationship between experience (Years) and performance (Hits) for misclassified players. We can observe several patterns:

  • Players with high hits but low experience are often predicted to have higher salaries than their actual level
  • Players with extensive experience but moderate performance tend to be classified in higher salary categories than warranted
  • The misclassifications suggest that our model may not fully capture how teams value the interaction between experience and performance

The right plot shows the salary distribution of misclassified players by their actual and predicted categories. Key insights include:

  • Many misclassified players have salaries near the boundaries between categories
  • For Medium salary players predicted as High, their actual salaries tend to be in the upper range of the Medium category
  • Similarly, Low salary players predicted as Medium often have salaries at the higher end of the Low category
  • This suggests many misclassifications occur for “borderline” cases where players are near category thresholds

7.8.4 Comparing Key Metrics Between Correctly and Incorrectly Classified Players

To gain further insight into what distinguishes misclassified players, we’ll compare their average statistics with those of correctly classified players:

Comparison of Key Metrics Between Correctly and Incorrectly Classified Players
Variable Misclassified Players (Mean) Correctly Classified Players (Mean) Percentage Difference (%)
Years Years 8.3 6.9 20.8
Hits Hits 100.7 110.6 -9.0
HmRun HmRun 11.6 11.6 0.7
RBI RBI 49.6 51.9 -4.4
Walks Walks 36.4 42.3 -14.0
Salary Salary 506.6 542.3 -6.6

Metric Differences Visualization: This chart illustrates the percentage differences in key statistics between misclassified and correctly classified players. The most substantial differences are in:

  • Home Runs: Misclassified players have significantly different home run totals compared to correctly classified players, suggesting power hitting may have unique valuation patterns that our model doesn’t fully capture
  • Salary: The actual salary difference indicates that misclassified players often have salaries that differ from the typical values in their assigned categories
  • RBIs: The substantial difference in RBIs suggests this performance metric may have a more complex relationship with salary than our model accounts for

These findings reveal specific areas where our predictive model could be enhanced, potentially by incorporating interaction terms or allowing for non-linear relationships between these key variables and salary levels.

7.9 Insights from Misclassification Analysis

The analysis of misclassified cases reveals several important patterns that help us understand the limitations of our model:

  1. Misclassification Types: The most common misclassification occurs between adjacent salary categories (Low→Medium or Medium→High), with very few cases where players are misclassified by two levels (Low→High or High→Low). This is reassuring as it suggests the model rarely makes egregious errors.
  2. Class-specific Error Rates: Medium salary players have the highest misclassification rate at 46.9%, suggesting this group may have more ambiguous characteristics that overlap with both low and high salary players.
  3. Borderline Cases: Many misclassified players appear to have salaries near the boundary between categories, as shown in the salary distribution plot. This indicates that some misclassifications are due to the somewhat arbitrary nature of the salary category boundaries.
  4. Experience vs. Performance: The Years vs. Hits plot indicates that misclassifications often involve players with either:
    • More experience but lower performance than expected for their salary level
    • Less experience but higher performance than expected for their salary level
  5. Key Variable Differences: On average, misclassified players differ from correctly classified ones primarily in Years (20.8% difference). This suggests that our model might be overemphasizing certain metrics at the expense of others.
  6. Special Cases: The analysis suggests there may be unique player circumstances not captured by our model, such as:
    • Players who recently signed new contracts that don’t yet reflect their performance
    • Star players who may be overpaid for marketing or historical reasons
    • Players who produce in ways not fully captured by traditional statistics

These insights suggest several opportunities for model improvement:

  1. Interaction Terms: Adding interaction terms between experience and performance metrics could better capture how these factors work together to determine salary.
  2. Weighted Recent Performance: Giving more weight to recent years’ performance might improve predictions.
  3. Additional Variables: If available, factors like player popularity, market size, or team revenue could enhance the model.
  4. Alternative Modeling Approaches: Machine learning techniques like random forests or boosting might better capture non-linear relationships in the data.

7.10 Explanation of Selection Process and Model Superiority

The stepwise selection procedure was chosen to identify a more parsimonious model that balances complexity and predictive performance. Here’s why this approach was beneficial:

  1. Model Complexity Reduction: The stepwise procedure reduced the number of parameters from 40 in the full model to 26 in the selected model - a 35% reduction. This simplification makes the model:

    • More interpretable, focusing on the most important variables
    • Less prone to overfitting
    • Computationally more efficient
  2. Addressing Multicollinearity: The VIF analysis showed significant multicollinearity among predictors. The stepwise procedure helped by:

    • Removing redundant variables that provide similar information
    • Selecting variables that offer unique explanatory power
    • Creating a more stable model less affected by correlation issues
  3. Performance Comparison: The cross-validation results demonstrate that:

    • The stepwise model achieves an accuracy of 67.62% (10-fold CV) compared to 64.97% for the full model and 66.25% for the PCA model
    • The stepwise model outperforms the full model despite using fewer parameters
    • The stepwise model outperforms the PCA model while maintaining the interpretability of original variables
  4. Generalization Assessment: The difference between in-sample and cross-validation accuracy is 5.74% for the stepwise model, compared to 7.62% for the full model and 0.93% for the PCA model. The PCA model shows the smallest generalization gap, with the stepwise model coming in second.

  5. Key Selected Predictors: The selected variables show a balanced mix of:

    • Current season performance metrics
    • Career statistics
    • Years of experience
    • League and division information

    This suggests the model has identified the most relevant factors affecting player salaries from multiple dimensions.

7.10.1 Conclusion on Model Selection

Based on our comprehensive analysis, the Stepwise Model emerges as our recommended approach. This conclusion is supported by the following evidence:

  1. Predictive Performance: The Stepwise Model achieves excellent cross-validation accuracy at 67.62%%.
  2. Model Parsimony: The Stepwise Model uses significantly fewer parameters than the full model, demonstrating the principle of achieving similar or better performance with a simpler model.
  3. Generalization Ability: The Stepwise Model shows strong generalization properties with a gap of 5.74% between in-sample and cross-validation accuracy.
  4. Interpretability: The stepwise model maintains the original variables, which are directly interpretable in baseball terms.

When compared to the alternatives: - The stepwise model offers a balanced approach, with good performance, reduced complexity, and interpretable variables - The PCA model effectively addresses multicollinearity but sacrifices some interpretability - The full model provides a baseline but includes potentially redundant information

This model selection process demonstrates the importance of considering multiple criteria—predictive performance, complexity, generalization, and interpretability—when choosing a final model for practical applications.

8 Conclusion

8.1 Summary of Analysis Workflow

We have successfully completed an extensive analysis of the MLB Hitters dataset, following a systematic approach:

  1. Data Import and Pre-Processing
    • Imported the Hitters dataset with proper error checking
    • Identified and handled missing values (18% of the data in the Salary column)
    • Detected and analyzed outliers, deciding to keep them as legitimate extreme values
    • Conducted exploratory data analysis with improved visualizations
    • Standardized numeric predictors while preserving the target variable
  2. Creating the Categorical Response Variable
    • Transformed the Salary variable into three balanced levels using tertiles
    • Compared statistical tertiles with MLB-specific salary benchmarks
    • Created a balanced categorical response variable (SalaryLevel)
    • Provided clear justification for the tertile approach
  3. Visual Data Inspection
    • Created boxplots, scatter plots, and density plots with consistent styling
    • Generated an enhanced correlation heatmap to identify multicollinearity
    • Visualized relationships between categorical and numeric variables
    • Added detailed interpretations of all visualizations
  4. Multinomial Logistic Regression Model
    • Created a reusable function for model evaluation
    • Fitted a full model with all predictors
    • Calculated and interpreted p-values with significance indicators
    • Created a visually improved confusion matrix
    • Identified and visualized multicollinearity issues
  5. Addressing Multicollinearity with PCA
    • Applied Principal Component Analysis to transform correlated predictors
    • Visualized variance explained by each component
    • Created an improved biplot for interpreting principal components
    • Built and evaluated a model using PCA-transformed data
  6. Model Evaluation via Cross-Validation
    • Created a function to standardize cross-validation procedures
    • Performed 5-fold and 10-fold cross-validation
    • Calculated misclassification rates and other metrics
    • Compared results across different validation schemes
    • Added clear visualizations comparing model performance
  7. Model Improvement: Predictor Selection
    • Used stepwise selection to identify a more parsimonious model
    • Evaluated the reduced model using cross-validation
    • Calculated and visualized odds ratios for better interpretation
    • Compared performance of full, PCA, and selected models
    • Analyzed misclassified cases to understand model limitations
  8. Conclusion
    • Confirmed the successful execution of all analysis steps
    • Summarized model complexity and performance metrics

8.2 Key Findings and Implications

Our analysis has demonstrated that MLB player salaries can be predicted with moderate accuracy using performance statistics. Here are the key findings:

  1. Predictive Performance: The stepwise model achieved an accuracy of approximately 66.4% with substantially fewer parameters than the full model, making it more interpretable and computationally efficient.

  2. Important Predictors: The most influential factors in determining player salary level are:

    • Years of experience in the league (positive relationship)
    • Career performance metrics (especially home runs and RBIs)
    • Current season performance (hits, runs, etc.)
    • League and division
  3. Multicollinearity: We identified and addressed significant multicollinearity among predictors through both PCA and feature selection approaches. Both methods maintained similar predictive performance while simplifying the model.

  4. Misclassification Patterns: The model struggles most with:

    • Players near category boundaries
    • Players with unusual combinations of experience and performance
    • Medium salary players, who show characteristics that overlap with both low and high salary categories
  5. Model Selection Trade-offs:

    • The full model provided baseline performance but suffered from multicollinearity
    • The PCA model effectively addressed multicollinearity but with less interpretable features
    • The stepwise model provided the best balance of interpretability, simplicity, and performance

Group E ~ 7 April 2025